One of my projects at work uses the Python package Celery with Redis to manage executing background tasks. And we ran into some odd behavior that we didn't see explained anywhere else, so I figure I'll capture it here for the next poor soul running into these issues.
First, if you care about this subject, you should read this post over at Instawork which is a good discussion of the risks involved in using countdown
and eta
. It helps set the stage.
Setup
We're using Celery with a Redis broker as part of a Django application. We apply one of 3 priorities to each of our tasks: Low, Medium, and High. High-priority tasks represent things that a human user is waiting on and need to be completed as soon as possible. Low-priority tasks are things that need to happen eventually, but we don't really care when. And anything else gets configured as medium priority.
This set up worked in our validation testing. We saw the queues get loaded up in Redis and the workers execute tasks in priority order as expected.
The Wrong Queue
After a large-scale data-processing task we noticed that high-priority user tasks were not executing.
When I inspected the queues in Redis I found that the high-priority queue was full of low-priority tasks. So the workers were extremely busy (correctly) processing the queue, but the tasks they were running were low priority. And the human's task was stuck behind them all.
How did this happen?
Countdown/ETA Reservations
The first part of the puzzle is how Celery handles countdown
/ eta
tasks. countdown
allows you to say "execute the task 5 minutes from now" while eta
allows you to say "execute the task no earlier than March 10, 2023 at 10:08AM."
countdown
is purely syntactic sugar for eta
so that you don't have to calculate actual times yourself, so when you call apply_async
with a countdown
parameter Celery converts it to an eta
parameter. Since internally Celery only concerns itself with eta
values we'll only talk in terms of eta
from this point on.
When an eta
task is generated it gets put into the appropriate priority queue. But, it doesn't stay in the queue until its eta
passes. Instead, any worker checking the queue will reserve the task immediately and hold it internally until the eta
passes.
During my investigation, while the workers were idle, I scheduled a few hundred tasks with an eta
and, as a result of the above behavior, the priority queues in Redis were empty. Workers will continue reserving eta
tasks from queues until they have a task that needs to actually execute now. Once the workers are busy, eta
tasks will stay in their appropriate queues until a worker is freed up and comes looking again.
Processing Reserved Tasks
Alright, so our workers have reserved all of our eta
tasks with varying priorities and now the tasks are starting to pass their eta
s and need to be executed. At this point the worker completely ignores the priorities on the tasks. It begins executing whichever reserved task it happens upon first in its internal data structure (this is probably an internal queue, but I don't know for sure).
So once a worker reserves a task its priority is no longer respected. If you schedule a few hundred eta
tasks with mixed priorities (as I did) you see them executed in what appears to be an arbitrary order (I suspect they're actually executed in order they were reserved, but I haven't verified that because it's not relevant to my concerns).
This is not good and reason enough to avoid using eta
tasks for anything but high-priority tasks. But, it doesn't explain how we ended up with low-priority tasks in our high-priority queue in Redis.
Death of a Celery Worker
We have Celery configured to replace each worker after completing 10 tasks. This was an attempt to work around an issue where workers would stop pulling tasks from the queue and everything would stall out. We had a hypothesis that the issue was unclosed connections to Redis and so replacing the workers would force unclosed connections to get cleaned up. We haven't yet verified what was actually happening or if replacing the workers fixed anything though. It's a very intermittent problem and we haven't identified a sure trigger. (Though we did solidly identify that if Redis isn't ready to serve connection when Celery starts then Celery will not reconnect properly and workers will only execute a single task before hanging forever.)
Anyway, the point is that we replace our workers every so often. Well, what happens to all those eta
tasks the worker had reserved? They go back in the queue so another worker can get them. But, they all go into the default queue instead of going back to their appropriate priority queues. It happens that the default queue is the high-priority queue. So each time a worker was getting cycled, all the eta
tasks it held were pushed into the high-priority queue.
The perfect storm
So here's the scenario. Our workers are sitting idle waiting for work to do. Our large-scale data-processing task schedules a bunch of low-priority jobs with an eta
. The workers eagerly snap up all these tasks and reserve them for future execution. As soon as the earliest eta
s pass each worker begins executing and stops reserving more eta
tasks from the low-priority queue.
Each worker completes 10 tasks and gets replaced. As each worker is replaced it returns the remainder of its eagerly-reserved eta
tasks to the high-priority queue. The new workers being spawned now begin processing the high-priority queue since it's full of tasks.
A user comes along and engages in an action backed by a high-priority task. But the user's high-priority task is now stuck behind several thousand low-priority tasks that have been misplaced in the high-priority queue.
Moral of the story
We had run across the countdown
parameter to apply_async
and thought it would be a good way to avoid some unnecessary work by pairing some high-churn jobs with a flag so they'd only be scheduled again if they weren't already scheduled (this flag was managed outside of the Celery world).
We will be rolling back that change so as to avoid this situation in the future.