I'm having trouble understanding the purpose of 'distributed task queues'. For example, python's celery library.
I know that in celery, the python framework, you can set timed windows for functions to get executed. However, that can also be easily done in a linux crontab directed at a python script.
And as far as I know, and shown from my own django-celery webapps, celery consumes much more RAM memory than just setting up a raw crontab. Few hundred MB difference for a relatively small app.
Can someone please help me with this distinction? Perhaps a high level explanation of how task queues / crontabs work in general would be nice also.
Thank you.
It depends what you want your tasks to be doing, if you need to distribute them, and how you want to manage them.
A crontab is capable of executing a script every N intervals. It runs, and then returns. Essentially you get a single execution each interval. You could just direct a crontab to execute a django management command and get access to the entire django environment, so celery doesn't really help you there.
What celery brings to the table, with the help of a message queue, is distributed tasks. Many servers can join the pool of workers and each receive a work item without fear of double handling. It's also possible to execute a task as soon as it is ready. With cron, you're limited to a minimum of one minute.
As an example, imagine you've just launched a new web application and you're receiving hundreds of sign ups that require an email to be sent to each user. Sending an email may take a long time (comparatively) so you decide that you'll handle activation emails via tasks.
If you were using cron, you'd need to ensure that every minute cron is able to process all of the emails that need to be sent. If you have several servers you now need to make sure that you aren't sending multiple activation emails to the same user - you need some kind of synchronization.
With celery, you add a task to the queue. You may have several workers per server so you've already scaled ahead of a cronjob. You may also have several servers allowing you to scale even more. Synchronization is handled as part of the 'queue'.
You can use celery as a cron replacement but that's not really its primary use. It is used for farming out asynchronous tasks across a distributed cluster.
And of course, celery has a big list of features that cron does not.