Best practices for multithreaded processing of database records

Mike Sickler picture Mike Sickler · Feb 18, 2009 · Viewed 8.4k times · Source

I have a single process that queries a table for records where PROCESS_IND = 'N', does some processing, and then updates the PROCESS_IND to 'Y'.

I'd like to allow for multiple instances of this process to run, but don't know what the best practices are for avoiding concurrency problems.

Where should I start?

Answer

MarkR picture MarkR · Feb 20, 2009

The pattern I'd use is as follows:

  • Create columns "lockedby" and "locktime" which are a thread/process/machine ID and timestamp respectively (you'll need the machine ID when you split the processing between several machines)
  • Each task would do a query such as:

    UPDATE taskstable SET lockedby=(my id), locktime=now() WHERE lockedby IS NULL ORDER BY ID LIMIT 10

Where 10 is the "batch size".

  • Then each task does a SELECT to find out which rows it has "locked" for processing, and processes those
  • After each row is complete, you set lockedby and locktime back to NULL
  • All this is done in a loop for as many batches as exist.
  • A cron job or scheduled task, periodically resets the "lockedby" of any row whose locktime is too long ago, as they were presumably done by a task which has hung or crashed. Someone else will then pick them up

The LIMIT 10 is MySQL specific but other databases have equivalents. The ORDER BY is import to avoid the query being nondeterministic.