views:

2232

answers:

5

Hi,

I've spent today looking into lockless queues. I have a multiple producer, multiple consumer situation. I implemented, for testing, a system using the Interlocked SList thing under Win32 and it doubled the performance of my heavily threaded task based code. Unfortunately, though, I wish to support multiple platforms. Interlocking on multiple platforms itself is not a problem and I can safely assume I can interlock without problems. However the actual implementation loses me.

The big problem appears to be that you need to guarantee a list push/pop will use only one Interlocked call. Otherwise you are leaving space for another thread to nip in and screw things up. I'm unsure how the microsoft implementation works under-the-hood and would love to know more.

Can anyone point me towards useful information (Platform and language are pretty irrelevant)?

Added to that I'd love to know if its possible to implement a lockless vector. That would have huge amounts of use for me :) Cheers!

Edit: Having read herb's DDJ article I can see a reduced lock queue that is pretty similar to the one I already had. However I notice that there are papers at the end that can do the true lockless queue'ing using a double compare-and-swap (DCAS) operation. Has anyone implemented a queue using cmpxchg8b (or cmpxchg16b for that matter)?

I'm just musing at this point (having not read the papers) but you could use this system to update the head and tail pointer simultaneously and thus avoid any issues with another thread jumping inbetween the 2 atomic operations. However you still need to get the next head pointer to test that against the tail pointer to see if yo have just modified the tail. How do you avoid another thread changing this information while the other thread is preparing to do so itself? How exactly is this implemented in a lockless way? Or am i better off reading the undecipherability that is a research paper? ;)

+1  A: 

These guys have, maybe you could find some inspiration there. The other interesting files are yqueue.hpp and atomic_ptr.hpp

nos
+2  A: 

I think there's some interesting discussion on this topic here, particularly this thread.

MikeB
+3  A: 

You could probably implement a limited-size queue with least difficulty... I was thinking about it lately and came up with this design, but you can probably find many other interesting ideas: (WARNING: it might have some issues!)

  • the queue is an array of pointers to items
  • you have to manage 2 pointers (head, tail) which work over the queue in the same way as a circular buffer
  • if head == tail, there are no items
  • if you want to enqueue(ptr), Interlocked-Swap tail with NULL (prev_tail is the swapped value)
    • if prev_tail == NULL, try again
    • if prev_tail + 1 (with wraparound) == head, your queue is full
    • otherwise put your ptr in *prev_tail and assign prev_tail+1 to tail (watch out for buffer wrap-around)
  • for dequeue() make a copy tmp_head=head and check tmp_head == tail
    • if it's true, return because the queue is empty
    • if it's false
      • save *tmp_head as ptr
      • do a CAS: compare head with tmp_head swap head with head+1
      • if CAS failed -- start the whole function over
      • if it succeeded -- return ptr

You can wait on both head and tail CAS operations, but if the queue is not contended, you should succeed the first time, without unnecessary locks.

Unlimited-size queues are "a bit" harder ;) But you should be able to create a big-enough queue for most needs.

viraptor
How are you going to check if `head == tail` on dequeue without a lock?
Doug McClean
Ok - I didn't describe that step properly. You have to loop to the beginning, instead of around the CAS itself. CAS will detect if the head was moved.
viraptor
Sounds good, I agree with the expanded explanation. Just checking :).
Doug McClean
Essentially what you are describing is a spinlock on the queue isn't it? Why not just use a spin lock based critical section? Performance will be the same won't it? As for saying if the queue is not contended all will be good ... if its not contended then you may as well use a critical section ;)
Goz
Almost - with spinlock, you actually have to lock in a different place, which makes the whole thing a couple of operations longer. I don't know what the gain is, but if you care enough to look for a lockless solution, then it's hard to get to a faster version than one assignment + one cas.
viraptor
+1  A: 

viraptor solution is locking , there are no multiple producer/multiple consumer lockless queue algotihms im aware of .

ben
I would argue it is lockless - it does not contain mutexes. It does require synchronisation, but so does anything that is thread-safe. You cannot make anything threadsafe without the possibility of retrying (either the operation itself, or retrying the mutex lock).
viraptor
Sorry i didnt check close enough your solution is the typical single producer /consumer Cas queue i thought it was a spinlock. However to do MULIPLE CONSUMER AND MULTIPLE PRODUCER there have been no scientific papers regarding a CAS solution they all fail.
ben
+1  A: 

You might want to take a look at Herb Sutters implementation of a low lock queue.

http://www.drdobbs.com/hpc-high-performance-computing/211601363

It does use the c++0x atomics but it would be (should be) easy to implement with your specific architectures atomic ops (_sync* using GNU, atomic_* on solaris etc).

ScaryAardvark