views:

2618

answers:

10

I'm in the process of designing a system which connects to one or more stream of data feeds and do some analysis on the data than trigger events based on the result. In a typical multi-threaded producer/consumer setup, i will have multiple producer threads putting data into a queue, and multiple consumer threads reading the data, and the consumers are only interested in the latest data point plus n number of points. The producer threads will have to block if slow consumer can not keep up, and of course consumer threads will block when there are no unprocessed updates. Using a typical concurrent queue with reader/writer lock will work nicely but the rate of data coming in could be huge, so i wanted to reduce my locking overhead especially writer locks for the producers. I think a circular lock-free buffer is what i needed.

Now two questions: 1. Is circular lock-free buffer the answer? 2. If so, before i roll my own, do you know any public implementation that will fit my need?

Any pointers in implementing a circular lock-free buffer are always welcome.

BTW, doing this in C++ on Linux.

Some additional info:

The response time is critical for my system. Ideally the consumer threads will want to see any updates coming in as soon as possible because an extra 1 millisecond delay could make the system worthless, or worth a lot less.

The design idea i'm leaning toward is a simi-lock-free circular buffer where the producer thread put data in the buffer as fast as it can, let's call the head of the buffer A, without blocking unless the buffer is full, when A meets the end of buffer Z. Consumer threads will each hold two pointers to the circular buffer, P and Pn, where P is the thread's local buffer head, and Pn is nth item after P. Each consumer thread will advance its P and Pn once it finish processing current P and the end of buffer pointer Z is advanced with the slowest Pn. When P catch up to A, which means no more new update to process, the consumer spins and do busy wait for A to advance again. If consumer thread spin for too long, it can be put to sleep and wait for a condition variable, but i'm okay with consumer taking up CPU cycle waiting for update because that does not increase my latency (i'll have more CPU cores than threads). Imagine you have a circular track, and the producer is running in front of a bunch of consumers, the key is to tune the system so that the producer is usually runing just a few step ahead of the consumers, and most of these operation can be done using lock-free techniques. I understand getting the details of the implementation right is not easy...okay, very hard, that's why i want to learn from other's mistakes before making a few of my own.

+4  A: 

There is a pretty good series of articles about this on DDJ. As a sign of how difficult this stuff can be, it's a correction on an earlier article that got it wrong. Make sure you understand the mistakes before you roll your own )-;

Henk Holterman
+7  A: 

The requirement that producers or consumers block if the buffer is empty or full suggests that you should use a normal locking data structure, with semaphores or condition variables to make the producers and consumers block until data is available. Lock-free code generally doesn't block on such conditions - it spins or abandons operations that can't be done instead of blocking using the OS. (If you can afford to wait until another thread produces or consumes data, then why is waiting on a lock for another thread to finish updating the data structure any worse?)

On (x86/x64) Linux, intra-thread synchronization using mutexes is reasonably cheap if there is no contention. Concentrate on minimizing the time that the producers and consumers need to hold onto their locks. Given that you've said that you only care about the last N recorded data points, I think a circular buffer would be do this reasonably well. However, I don't really understand how this fits in with the blocking requirement and the idea of consumers actually consuming (removing) the data they read. (Do you want consumers to only look at the last N data points, and not remove them? Do you want producers to not care if consumers can't keep up, and just overwrite old data?)

Also, as Zan Lynx commented, you can aggregate/buffer up your data into bigger chunks when you've got lots of it coming in. You could buffer up a fixed number of points, or all the data received within a certain amount of time. This means that there will be fewer synchronization operations. It does introduce latency, though, but if you're not using real-time Linux, then you'll have to deal with that to an extent anyway.

Doug
+7  A: 

The term of art for what you want is a lock-free queue. There's an excellent set of notes with links to code and papers by Ross Bencina. The guy whose work I trust the most is Maurice Herlihy (for Americans, he pronounces his first name like "Morris").

Norman Ramsey
A queue is not a circular buffer.
Blank Xavier
@Blank Xavier: No, but a circular buffer is a queue. The problem calls for a queue. And the most efficient implementation of a queue is as... (wait for it) a circular buffer. In any case, if you want to search, you search for 'lock-free queue', not 'lock-free circular buffer'.
Norman Ramsey
+1  A: 

One useful technique to reduce contention is to hash the items into multiple queues and have each consumer dedicated to a "topic".

For most-recent number of items your consumers are interested in - you don't want to lock the whole queue and iterate over it to find an item to override - just publish items in N-tuples, i.e. all N recent items. Bonus points for implementation where producer would block on the full queue (when consumers can't keep up) with a timeout, updating its local tuple cache - that way you don't put back-pressure on the data source.

Nikolai N Fetissov
I have also considered the boss/worker threading model where the boss thread multicast updates to worker threads' private queues. I think this is more/less the direction you're going. I have to give it more though, but when i was considering it, boss/worker seemed to have too much overhead because all worker must get the same updates.
Shing Yip
Not exactly - what I mean in the first point is slice your incoming stream so not all the threads compete for the same lock/queue.The second point is caching on the producer side to accommodate spikes on the input and also allow for slow consumers not to halt the producer.
Nikolai N Fetissov
But business logic requires all worker thread to know all data that is streaming in. There is one and only one type of data coming in and every data point is equally important, so i can't really slice my incoming stream and have different data in different queues. Cashing on the producer side and bundling updates to the data model to prevent hogging has been done and it wasn't enough to handle the load.
Shing Yip
How big is the input domain? If it's something like market data in financial world, you have limited, though big, number of items and only few types of updates. Are workers reactive to the input events or do they do their own processing and only poll for your input when necessary?
Nikolai N Fetissov
It is something like market data in the financial world. Workers do their own processing and they have random access to n numbers of update history when needed (n is a configurable number but will not change during the lifetime of the process). I would like to design a system that works well on both large and small n so I can have one code base.
Shing Yip
Well, you can hand off n-tuples to consumers with swap of a pointer to it (you'd need to worry about memory fences, etc. - ordered atomic) but then you'll get into memory management issues like with hazard pointers.
Nikolai N Fetissov
+1  A: 

http://dream.eng.uci.edu/eecs123/2009sp_3_NBB.pdf

My professor's lecture slideshow regarding lock-free / non-blocking buffers.

switchmode
Is your Prof colour blind? do his students all wear sunglasses?
Blank Xavier
+3  A: 

I would agree with this article and recommend against using lock-free data structures. A relatively recent paper on lock-free fifo queues is this, search for further papers by the same author(s); there's also a PhD thesis on Chalmers regarding lock-free data structures (I lost the link). However, you did not say how large your elements are -- lock-free data structures work efficiently only with word-sized items, so you'll have to dynamically allocate your elements if they're larger than a machine word (32 or 64 bits). If you dynamically allocate elements, you shift the (supposed, since you haven't profiled your program and you're basically doing premature optimization) bottleneck to memory allocator, so you need a lock-free memory allocator, e.g., Streamflow, and integrate it with your application.

zvrba
+3  A: 

Sutter's queue is sub-optimal and he knows it. The Art of Multicore programming is a great reference but don't trust the Java guys on memory models, period. Ross's links will get you no definite answer because they had their libraries in such problems and so on.

Doing lock-free programming is asking for trouble, unless you want to spend a lot of time on something that you are clearly over-engineering before solving the problem (judging by the description of it, it is a common madness of 'looking for perfection' in cache coherency). It takes years and leads to not solving the problems first and optimising later, a common disease.

rama-jka toti
Would you post a link to analysis of Sutter's queue?
Nikolai N Fetissov
it's all on DDJ and one of the guys following his blogs profiled it .. The point is that hot CAS isn't required for many scenarios and that you can beat that kind of fine granularity any day even with simple swaps.
rama-jka toti
Do you mean Dennis Lang? http://home.comcast.net/~lang.dennis/code/#ring
Nikolai N Fetissov
That's it, thanks.. But I believe it still might have some races. Heck, anything out there as sensitive as expecting implicit barriers or specific or full coherency-understanding is a problem waiting to happen in production. I just don't believe that level of detail solves so much as focus on application-level design rather than low-level plumbing when/only-if it makes sense/is-identified to be so. I applaud the effort, the books, all of it; but its just articles on a touch topic even MS is struggling to do well for the mass-market PFX crowd.
rama-jka toti
Just an opinion there is always more important work to do than looking into plumbing. Parallel efforts ripple across the board not just queues, or indeed mid-1990s threading DDJ articles keep reinventing; that is, from NT to later Solaris and Unix adopting similar techniques or latest work on C++. The latter is and probably will take ages to complete and still fight the fact no clean OO-way to post P2-Pro-like out-of-order universe is sensible..
rama-jka toti
+5  A: 

I've made a particular study of lock-free data structures in the last couple of years. I've read most of the papers in the field (there's only about fourty or so - although only about ten or fifteen are any real use :-)

AFAIK, a lock-free circular buffer has not been invented. The problem will be dealing with the complex condition where a reader overtakes a writer or vis-versa.

If you have not spent at least six months studying lock-free data structures, do not attempt to write one yourself. You will get it wrong and it may not be obvious to you that errors exist, until your code fails, after deployment, on new platforms.

I believe however there is a solution to your requirement.

You should pair a lock-free queue with a lock-free free-list.

The free-list will give you pre-allocation and so obviate the (fiscally expensive) requirement for a lock-free allocator; when the free-list is empty, you replicate the behaviour of a circular buffer by instantly dequeuing an element from the queue and using that instead.

(Of course, in a lock-based circular buffer, once the lock is obtained, obtaining an element is very quick - basically just a pointer dereference - but you won't get that in any lock-free algorithm; they often have to go well out of their way to do things; the overhead of failing a free-list pop followed by a dequeue is on a par with the amount of work any lock-free algorithm will need to be doing).

Michael and Scott developed a really good lock-free queue back in 1996. A link below will give you enough details to track down the PDF of their paper; Michael and Scott, FIFO

A lock-free free-list is the simplest lock-free algorithm and in fact I don't think I've seen an actual paper for it.

Blank Xavier
BTW, I implemented this. http://www.liblfds.org
Blank Xavier
A: 

Here is how I would do it:

  • map the queue into an array
  • keep state with a next read and next next write indexes
  • keep an empty full bit vector around

Insertion consists of using a CAS with an increment and roll over on the next write. Once you have a a slot, add your value and then set the empty/full bit that matches it.

Removals require a check of the bit before to test on underflows but other than that, are the same as for the write but using read index and clearing the empty/full bit.

Be warned,

  1. I'm no expert in these things
  2. atomic ASM ops seem to be very slow when I've used them so if you end up with more than a few of them, you might be faster to use locks embedded inside the insert/remove functions. The theory is that a single atomic op to grab the lock followed by (very) few non atomic ASM ops might be faster than the same thing done by several atomic ops. But to make this work would require manual or automatic inlineing so it's all one short block of ASM.
BCS
Atomic operations are indeed slow, in and of themselves. What makes them useful is that they scale.
Blank Xavier
If the operations inside the lock are very small (as in 5-10 lines of ASM) you might still be ahead to use a locking strategy, particularly if you write the lock directly into the critical sections rather than as a function call.
BCS
I'm confused. A critical section to me *is* the section of code which must be serially executed. The lock is the mechanism which ensures serialiality of execution. Could you explain to me what you mean?
Blank Xavier
That is what it is to me as well. See edit.
BCS
A: 

Just for completeness: there's well tested lock-free circular buffer in OtlContainers, but it is written in Delphi (TOmniBaseBoundedQueue is circular buffer and TOmniBaseBoundedStack is bounded stack). There's also an unbounded queue in the same unit (TOmniBaseQueue). The unbounded queue is described in Dynamic lock-free queue – doing it right. The initial implementation of the bounded queue (circular buffer) was described in A lock-free queue, finally! but the code was updated since then.

gabr