I've spent today looking into lockless queues. I have a multiple producer, multiple consumer situation. I implemented, for testing, a system using the Interlocked SList thing under Win32 and it doubled the performance of my heavily threaded task based code. Unfortunately, though, I wish to support multiple platforms. Interlocking on multiple platforms itself is not a problem and I can safely assume I can interlock without problems. However the actual implementation loses me.
The big problem appears to be that you need to guarantee a list push/pop will use only one Interlocked call. Otherwise you are leaving space for another thread to nip in and screw things up. I'm unsure how the microsoft implementation works under-the-hood and would love to know more.
Can anyone point me towards useful information (Platform and language are pretty irrelevant)?
Added to that I'd love to know if its possible to implement a lockless vector. That would have huge amounts of use for me :) Cheers!
Edit: Having read herb's DDJ article I can see a reduced lock queue that is pretty similar to the one I already had. However I notice that there are papers at the end that can do the true lockless queue'ing using a double compare-and-swap (DCAS) operation. Has anyone implemented a queue using cmpxchg8b (or cmpxchg16b for that matter)?
I'm just musing at this point (having not read the papers) but you could use this system to update the head and tail pointer simultaneously and thus avoid any issues with another thread jumping inbetween the 2 atomic operations. However you still need to get the next head pointer to test that against the tail pointer to see if yo have just modified the tail. How do you avoid another thread changing this information while the other thread is preparing to do so itself? How exactly is this implemented in a lockless way? Or am i better off reading the undecipherability that is a research paper? ;)
You could probably implement a limited-size queue with least difficulty... I was thinking about it lately and came up with this design, but you can probably find many other interesting ideas: (WARNING: it might have some issues!)
head
== tail
, there are no itemsenqueue(ptr)
, Interlocked-Swap tail
with NULL (prev_tail
is the swapped value)
prev_tail == NULL
, try againprev_tail + 1
(with wraparound) == head
, your queue is fullptr
in *prev_tail
and assign prev_tail+1
to tail
(watch out for buffer wrap-around)dequeue()
make a copy tmp_head=head and check tmp_head == tail
*tmp_head
as ptr
head
with tmp_head
swap head
with head+1
ptr
You can wait on both head and tail CAS operations, but if the queue is not contended, you should succeed the first time, without unnecessary locks.
Unlimited-size queues are "a bit" harder ;) But you should be able to create a big-enough queue for most needs.