How do I build a lockless queue?

Goz picture Goz · Nov 12, 2009 · Viewed 16.8k times · Source

I've spent today looking into lockless queues. I have a multiple producer, multiple consumer situation. I implemented, for testing, a system using the Interlocked SList thing under Win32 and it doubled the performance of my heavily threaded task based code. Unfortunately, though, I wish to support multiple platforms. Interlocking on multiple platforms itself is not a problem and I can safely assume I can interlock without problems. However the actual implementation loses me.

The big problem appears to be that you need to guarantee a list push/pop will use only one Interlocked call. Otherwise you are leaving space for another thread to nip in and screw things up. I'm unsure how the microsoft implementation works under-the-hood and would love to know more.

Can anyone point me towards useful information (Platform and language are pretty irrelevant)?

Added to that I'd love to know if its possible to implement a lockless vector. That would have huge amounts of use for me :) Cheers!

Edit: Having read herb's DDJ article I can see a reduced lock queue that is pretty similar to the one I already had. However I notice that there are papers at the end that can do the true lockless queue'ing using a double compare-and-swap (DCAS) operation. Has anyone implemented a queue using cmpxchg8b (or cmpxchg16b for that matter)?

I'm just musing at this point (having not read the papers) but you could use this system to update the head and tail pointer simultaneously and thus avoid any issues with another thread jumping inbetween the 2 atomic operations. However you still need to get the next head pointer to test that against the tail pointer to see if yo have just modified the tail. How do you avoid another thread changing this information while the other thread is preparing to do so itself? How exactly is this implemented in a lockless way? Or am i better off reading the undecipherability that is a research paper? ;)

Answer

viraptor picture viraptor · Nov 13, 2009

You could probably implement a limited-size queue with least difficulty... I was thinking about it lately and came up with this design, but you can probably find many other interesting ideas: (WARNING: it might have some issues!)

  • the queue is an array of pointers to items
  • you have to manage 2 pointers (head, tail) which work over the queue in the same way as a circular buffer
  • if head == tail, there are no items
  • if you want to enqueue(ptr), Interlocked-Swap tail with NULL (prev_tail is the swapped value)
    • if prev_tail == NULL, try again
    • if prev_tail + 1 (with wraparound) == head, your queue is full
    • otherwise put your ptr in *prev_tail and assign prev_tail+1 to tail (watch out for buffer wrap-around)
  • for dequeue() make a copy tmp_head=head and check tmp_head == tail
    • if it's true, return because the queue is empty
    • if it's false
      • save *tmp_head as ptr
      • do a CAS: compare head with tmp_head swap head with head+1
      • if CAS failed -- start the whole function over
      • if it succeeded -- return ptr

You can wait on both head and tail CAS operations, but if the queue is not contended, you should succeed the first time, without unnecessary locks.

Unlimited-size queues are "a bit" harder ;) But you should be able to create a big-enough queue for most needs.