Bull Queue Concurrency Questions

Brian picture Brian · Jan 10, 2019 · Viewed 18.8k times · Source

I need help understanding how Bull Queue (bull.js) processes concurrent jobs.

Suppose I have 10 Node.js instances that each instantiate a Bull Queue connected to the same Redis instance:

const bullQueue = require('bull');
const queue = new bullQueue('taskqueue', {...})
const concurrency = 5;
queue.process('jobTypeA', concurrency, job => {...do something...});

Does this mean that globally across all 10 node instances there will be a maximum of 5 (concurrency) concurrently running jobs of type jobTypeA? Or am I misunderstanding and the concurrency setting is per-Node instance?

What happens if one Node instance specifies a different concurrency value?

Can I be certain that jobs will not be processed by more than one Node instance?

Answer

Stas Korzovsky picture Stas Korzovsky · May 12, 2019

The TL;DR is: under normal conditions, jobs are being processed only once. If things go wrong (say Node.js process crashes), jobs may be double processed.

Quoting from Bull's official README.md:

Important Notes

The queue aims for an "at least once" working strategy. This means that in some situations, a job could be processed more than once. This mostly happens when a worker fails to keep a lock for a given job during the total duration of the processing.

When a worker is processing a job it will keep the job "locked" so other workers can't process it.

It's important to understand how locking works to prevent your jobs from losing their lock - becoming stalled - and being restarted as a result. Locking is implemented internally by creating a lock for lockDuration on interval lockRenewTime (which is usually half lockDuration). If lockDuration elapses before the lock can be renewed, the job will be considered stalled and is automatically restarted; it will be double processed. This can happen when:

  1. The Node process running your job processor unexpectedly terminates.
  2. Your job processor was too CPU-intensive and stalled the Node event loop, and as a result, Bull couldn't renew the job lock (see #488 for how we might better detect this). You can fix this by breaking your job processor into smaller parts so that no single part can block the Node event loop. Alternatively, you can pass a larger value for the lockDuration setting (with the tradeoff being that it will take longer to recognize a real stalled job).

As such, you should always listen for the stalled event and log this to your error monitoring system, as this means your jobs are likely getting double-processed.

As a safeguard so problematic jobs won't get restarted indefinitely (e.g. if the job processor aways crashes its Node process), jobs will be recovered from a stalled state a maximum of maxStalledCount times (default: 1).