
tl;dr
In this post, I’ll describe how our team diagnosed and resolved a serious bottleneck when our Oban job queue grew to over 20 million jobs in the “ state—and why that number kept growing instead of shrinking.
What is Oban?
Oban is a job processing library for Elixir. It uses PostgreSQL as a persistent backend (with recent support for MySQL and SQLite). Oban is available in three tiers:
- Oban (Free) – the open-source core with robust job processing capabilities
- Oban Pro – paid edition with batch jobs, workflows, and support
- Oban Enterprise – for large teams with advanced performance and governance needs
Our team uses Oban Pro, primarily for its powerful chunk-based batch processing and access to community support.
Why Oban?
You might ask, “Why not just use a GenServer and ETS?” If you’re building a side project with no users, that might work. But for production workloads handling millions of jobs, you need:
- Transactional safety – so jobs aren’t lost due to crashes or deploys
- Persistence and observability – all job states are stored in the database
- Structured retries – with exponential backoff and error tracking
Oban gives us these out of the box.
Architecture Overview
Here’s the architecture we use:
- Our service receives messages from a central messaging microservice
- For each message, we enqueue an Oban job with minimal logic upfront
- Jobs are split across different queues, based on the business domain
- We use ChunkWorker from Oban Pro to process jobs in batches (up to 1000)
- Failed jobs are retried up to 20 times
- Our Oban workers store processed results in ClickHouse
- We use AWS RDS PostgreSQL as the backing DB
This gives us loose coupling: the central service doesn’t need to wait on us. If our logic fails, the job stays in Oban and can be retried later.
What Went Wrong?
Our service depended on data from another microservice. That upstream service introduced a bug, causing our business logic to fail. Jobs were retried 20 times but kept failing.
At the same time, we continued receiving new jobs, leading to backlog. Eventually, we saw this familiar Ecto error:
(DBConnection.ConnectionError) client ... timed out because it queued and checked out the connection for longer than 15000ms
Oban workers couldn’t claim or process jobs fast enough. As a result, the “ jobs began piling up, reaching 20+ million.
The Resolution
To be honest, we weren’t initially clear on what available meant (hint: it’s not good). It turns out:
available means the job is ready to be processed, but no worker has claimed it yet.
After reaching out on the Elixir Forum, Oban maintainer Parker Selbert replied in under 4 hours (great support!). We took the following steps:
✅ Step-by-step:
- Fix the bug in our business logic that caused job failures.
- Upgrade Oban Pro and Oban Web to the latest versions.
- Apply Oban’s scaling recommendations for clustered environments (we use libcluster):
config :my_app, Oban,
notifier: Oban.Notifiers.PG,
insert_trigger: false
Temporarily stop incoming messages from the central service.- Reindex the “ table and run VACUUM ANALYZE:
REINDEX TABLE oban_jobs;
VACUUM ANALYZE oban_jobs;
**Move some available jobs to schedule state to give the system breathing room:
WITH jobs_to_update AS (
SELECT id
FROM oban_jobs
WHERE state = 'available'
LIMIT 100000
)
UPDATE oban_jobs
SET state = 'scheduled',
scheduled_at = now() + INTERVAL '1 hour'
WHERE id IN (SELECT id FROM jobs_to_update);
Once job count began to decrease, we resumed incoming traffic.- With Oban Web’s batch control interface and the latest scaling config, we monitored and manually moved jobs in chunks of 1000 as needed.
Lessons Learned (Bonus Tips)
- Always version your job arguments: When changing job schema, support old formats. Use a version field to job args and use pattern matching in the worker.
- Don’t forget the queue config: Defining a worker module isn’t enough—you must also declare the queue in your app config. If not, the jobs silently remain in available.
- Monitor proactively: Use oban_met or a cron job to track job state counts and alert if thresholds are breached.
Shoutout: Shannon and Parker
Oban is built by Shannon and Parker Selbert—a husband and wife team. They gave a fantastic talk at DORS/CLUC 2025 in Zagreb about how they turned an open-source project into a business.
Fun fact: they lived in Croatia back in 2014 (working on Ruby, not Oban yet). I had a chance to chat with them briefly and thank them for their support. Also learned their younger son is a speedcuber—so we had a short cube geek-out too!
Final Thoughts
Job backlogs happen. What matters is how fast you detect, debug, and fix the cause. Oban gave us the tools we needed: persistence, visibility, and control.
If you’re dealing with high-scale workloads, don’t treat available as a healthy state. It’s a red flag.
Got questions or ran into similar issues? Ping me on Elixir Forum or LinkedIn. I’d love to hear how you handled your Oban edge cases!