views:

1459

answers:

8

I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:

enqueue_nonblock(data) {
    if (NULL != buffer[head]) {
        return EWOULDBLOCK;
    }
    buffer[head] = data;
    head = NEXT(head);
    return 0;
}

dequeue_nonblock(data) {
    data = buffer[tail];
    if (NULL == data) {
        return EWOULDBLOCK;
    }
    buffer[tail] = NULL;
    tail = NEXT(tail);
    return 0;
}

Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?

And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.

+3  A: 

The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.

It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.

I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.

Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.

JustJeff
Surely that algorithm needs some memory barriers added?
bdonlan
Yes.. He says that "It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index."
ben
@bdonlan: (et al) not so. it's totally predicated on the order of operations and the fact of a single producer, single consumer. under those circumstances it's fine.
JustJeff
Only on machines where writes are not reordered ( which is all AFAIK) AND more importantly on machines where writes are not moved ahead of reads.Note the compiler changes are not as important as the CPU reordering.
ben
@ben: I don't believe reordering can swap operations more than a few machine instructions apart. It would depend on the cpu I guess. Anyway, the implementation virtually always consists of accessing the shared buffer, doing an index recalc (H or T), then updating the index. It could be that the index recalc puts enough other machine instructions between the two critical memory accesses to effectively suppress the reordering effect.
JustJeff
+4  A: 

If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.

anon
+1  A: 

I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.

I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...

AH-ha!

I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.

Greg D
And most notably read the fine print in that URL (search for "powerpc") and keep it in mind when you start inventing own lock-free structures.
Marco van de Voort
The description you give is of Michael and Scotts work - and I see from the link in the above comment that it is indeed this work; the psuedocode there is taken directly from the paper. The idea of the dummy node actually came from Valois.
Blank Xavier
+2  A: 

I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.

Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.

Chris Arguin
+1  A: 

I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.

To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.

If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator. There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow

Before starting with Lock-free stuff, look at:Circular lock-free buffer

bill
+1  A: 

Yes.

There exist a number of lock-free multiple-reader multiple-writer queues.

I have implemented one, by Michael and Scott, from their 1996 paper.

I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.

Blank Xavier
1. These use malloc nodes which tend to kill performance2. That algorithm uses CAS - A CAS puts a LOCK on the memory and hence is inferior to the above . In fact in cases where locks are rarely held ( eg quick locks) then a CAS == SpinLock on multi core.Would like to see it though.
ben
The OP asks for malloc.The library is here; http://www.liblfds.org
Blank Xavier
+1  A: 

Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.

To use malloc you would need to pre allocate all the nodes and manage them with another queue...

Note you can make some form of expandable array which would need to lock if it was expanded.

Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.

ben
+1  A: 

You should look at FastFlow library

Yogesh Arora