views:

860

answers:

5

I was reading through an answer that Jon Skeet gave to a question and in it he mentioned this:

As far as I'm concerned, lock-free multi-threading is for real threading experts, of which I'm not one.

Its not the first time that I have heard this, but I find very few people talking about how you actually do it if you are interested in learning how to write lock-free multi-threading code.

So my question is besides learning all you can about threading, etc where do you start trying to learn to specifically write lock-free multi-threading code and what are some good resources.

Cheers

A: 

When it comes to multi-threading you have to know exactly what you are doing. I mean explore all the possible scenarios/cases that might occur when you are working in a multi-threaded environment. Lock-free multithreading is not a library or a class which we incorporate, its a knowledge/experience that we earn during our journey on threads.

Bragboy
There are numerous libraries that provide lock-free threading semantics. STM is of particular interest, of which there are quite a number of implementations around.
Marcelo Cantos
I see both sides of this one. Getting effective performance out of a lock-free library requires deep knowledge of memory models. But a programmer who doesn't have that knowledge can still benefit from the correctness advantages.
Ben Voigt
+4  A: 

Google for lock free data structures and software transactional memory.

I'll agree with John Skeet on this one; lock-free threading is the devil's playground, and best left up to people who know that they know what they need to know.

Marcelo Cantos
+8  A: 

There is no such thing as "lock-free threading" these days. It was an interesting playground for academia and the like, back in the end of the last century when computer hardware was slow and expensive. Dekker's algorithm was always my favorite, modern hardware has put it out to pasture. It doesn't work anymore.

Two developments have ended this: the growing disparity between the speed of RAM and the CPU. And the ability of chip manufacturers to put more than one CPU core on a chip.

The RAM speed problem required the chip designers to put a buffer on the CPU chip. The buffer stores code and data, quickly accessible by the CPU core. And can be read and written from/to RAM at a much slower rate. This buffer is called the CPU cache, most CPUs have at least two of them. The 1st level cache is small and fast, the 2nd is big and slower. As long as the CPU can read data and instructions from the 1st level cache it will run fast. A cache miss is really expensive, it puts the CPU to sleep for as many as 10 cycles if the data is not in the 1st cache, as many as 200 cycles if it isn't in the 2nd cache and it needs to be read from RAM.

Every CPU core has its own cache, they store their own "view" of RAM. When the CPU writes data the write is made to cache which is then, slowly, flushed to RAM. Inevitable, each core will now have a different view of the RAM contents. In other words, one CPU doesn't know what another CPU has written until that RAM write cycle completed and the CPU refreshes its own view.

That is dramatically incompatible with threading. You always really care what the state of another thread is when you must read data that was written by another thread. To ensure this, you need to explicitly program a so-called memory barrier. It is a low-level CPU primitive that ensures that all CPU caches are in a consistent state and have an up to date view of RAM. All pending writes have to flushed to RAM, the caches then need to be refreshed.

This is available in .NET, the Thread.MemoryBarrier() method implements one. Given that this is 90% of the job that the lock statement does (and 95+% of the execution time), you are simply not ahead by avoiding the tools that .NET gives you and trying to implement your own.

Hans Passant
I'm confused by your answer. Cache synchronization and contention of resources are only some of the penalties of OS provided locking. What about the cost of rescheduling? Not sure what your bottom line point is here
Sam Post
Not sure how rescheduling plays a role at all. The point of threading is to have a lock *not* block the thread 99.9% of the time. If it is a lot less then you don't get any value for the money. If it does block, there isn't anything you can do but wait. The lock statement already implements a spinwait.
Hans Passant
Right I get that, I guess my question is why does "cache synchronization is slow" === "lock free is no better than locking"? I am new to lock free and like the sound of it but you seem to think its not super useful.
Sam Post
Maybe you could explain why you are considering "lock free threading". What do you hope to gain? If you think you can somehow make it more efficient then I failed to explain what the problem with the cpu cache is all about.
Hans Passant
What about the benefit to consuming code of eliminating the possibility of deadlocks and reducing opportunities for threading-related coding errors (once the difficult part of setting up such a framework is in place)? I'm less interested in the performance aspect (as long as it's not significantly worse than locking) than I am in the simplification of writing correct multi-threaded code. If done correctly it should give boost the ability of the average coder to write correct multithreaded code, much like garbage collection reduces the complexities of memory management.
Davy8
@Davy8: composition makes it still hard. If I have two lock-free hash-tables and as a consumer I access both of them, this will not guarantee consistency of state as a whole. The closest you can come today is STMs where you can put the two accesses e.g. in a single `atomic` block. All in all, consuming lock-free structures can be just as tricky in many cases.
andras
What definition are you using for lock free threading? Are Erlang's message passing or Software Transaction included, or not?
Rob Lachlan
@Rob: I believe that there is no magic going on here: STMs still utilize low-lock ("lock-free") techniques, such as lock striping or CAS+retry. I do not know much about Erlang internals, but message queues in a shared memory face the same problems. You have shared state - now it is the message queue that you have to lock or implement it with a "lock-free" structure. It is only the granularity and scope of the "lock" that differs. However using MPI or an STM can make a difference in terms of performance/usability - compared especially to an average, half-decent locking solution.
andras
Message-passing systems generally scale far better than anything that uses locks. They have to be designed that way from the ground up, but they're generally easier to reason about (i.e. formal correctness proofs) and good use of partitioning usually makes the complexity of any single component lower as well.
Ben Voigt
@Ben: I sincerely agree - what I wanted to point out is that as far as I know, you have to have some shared state somewhere in the system. True, it can be minimal in case of an MPI system. As far as I know, Erlang is built from the ground up with MPI in mind. It still uses locking at the lowest level. It turns out that it used one big fat lock for a global process queue. This created scalability problems, however. Now they plan to use multiple locks instead of one: http://stackoverflow.com/questions/605183/how-if-at-all-do-erlang-processes-map-to-kernel-threads
andras
I may be wrong, but I think you've mis-explained how cache coherency works. Most modern multicore processors have coherent caches, which means that the cache hardware handles making sure that all processes have the same view of RAM contents -- by blocking "read" calls until all corresponding "write" calls have completed. The Thread.MemoryBarrier() documentation (http://msdn.microsoft.com/en-us/library/system.threading.thread.memorybarrier.aspx) says nothing about cache behavior at all -- it's simply a directive that prevents the processor from reordering reads and writes.
Brooks Moses
"There is no such thing as "lock-free threading" these days." Tell that to the Erlang and Haskell programmers.
Juliet
@Brooks: modern x86 CPUs haven't maintained coherent caches for quite some time. The latest offerings from both Intel and AMD are highly non-uniform and trying to maintain coherency causes a severe performance hit. Instead cache coherency protocols are used to re-establish coherency at specific points during execution, and minimizing these is important to performance. Also, the register file is equivalent to another level of cache and it isn't even visible to cache coherency protocols. Ensuring that computations get flushed from registers to (cache) memory in the right order is critical.
Ben Voigt
@Ben: Thanks for the correction. Do you have any suggestions for good references where I can read up on this a bit more?
Brooks Moses
@Brooks: The wikipedia article http://en.wikipedia.org/wiki/MOESI_protocol has a basic discussion along with a bunch of links to other references, including the AMD engineering documents. Note that the mere existance of the "invalid" state implies that the cache isn't coherent. A coherent cache system (write-through) would snoop the write from the other processor and update the local cache, it wouldn't ever become invalid.
Ben Voigt
@Ben: Thanks for the link. I think we must simply be talking about rather different meanings of the phrase "coherent cache" -- when the first line of the description of a caching protocol is, "This is a full cache coherency protocol...," I tend to think of it as coherent. :) Other than that difference of terminology, though, I was talking about MOESI-protocol mechanics, so we're on the same page there.
Brooks Moses
+10  A: 

Joe Duffy's book:

http://www.bluebytesoftware.com/books/winconc/winconc_book_resources.html

He also writes a blog on these topics.

The trick to getting low-lock programs right is to understand at a deep level precisely what the rules of the memory model are on your particular combination of hardware, operating system, and runtime environment.

I personally am not anywhere near smart enough to do correct low-lock programming beyond InterlockedIncrement, but if you are, great, go for it. Just make sure that you leave lots of documentation in the code so that people who are not as smart as you don't accidentally break one of your memory model invariants and introduce an impossible-to-find bug.

Eric Lippert
+18  A: 

Current "lock-free" implementations follow the same pattern most of the time:

  • read some state and make a copy of it*
  • modify copy*
  • do an interlocked operation
  • retry if it fails

(*optional: depends on the data structure/algorithm)

The last bit is eerily similar to a spinlock. In fact, it is a basic spinlock. :)
I agree with @nobugz on this: the cost of the interlocked operations used in lock-free multi-threading is dominated by the cache and memory-coherency tasks it must carry out.

What you gain however with a data structure that is "lock-free" is that your "locks" are very fine grained. This decreases the chance that two concurrent threads access the same "lock" (memory location).

The trick most of the time is that you do not have dedicated locks - instead you treat e.g. all elements in an array or all nodes in a linked list as a "spin-lock". You read, modify and try to update if there was no update since your last read. If there was, you retry.
This makes your "locking" (oh, sorry, non-locking :) very fine grained, without introducing additional memory or resource requirements.
Making it more fine-grained decreases the probability of waits. Making it as fine-grained as possible without introducing additional resource requirements sounds great, doesn't it?

Most of the fun however can come from ensuring correct load/store ordering.
Contrary to one's intuitions, CPUs are free to reorder memory reads/writes - they are very smart, by the way: you will have a hard time observing this from a single thread. You will, however run into issues when you start to do multi-threading on multiple cores. Your intuitions will break down: just because an instruction is earlier in your code, it does not mean that it will actually happen earlier. CPUs can process instructions out of order: and they especially like to do this to instructions with memory accesses, to hide main memory latency and make better use of their cache.

Now, it is sure against intuition that a sequence of code does not flow "top-down", instead it runs as if there was no sequence at all - and may be called "devil's playground". I believe it is infeasible to give an exact answer as to what load/store re-orderings will take place. Instead, one always speaks in terms of mays and mights and cans and prepare for the worst. "Oh, the CPU might reorder this read to come before that write, so it is best to put a memory barrier right here, on this spot."

Matters are complicated by the fact that even these mays and mights can differ across CPU architectures. It might be the case, for example, that something that is guaranteed to not happen in one architecture might happen on another.


To get "lock-free" multi-threading right, you have to understand memory models.
Getting the memory model and guarantees correct is not trivial however, as demonstrated by this story, whereby Intel and AMD made some corrections to the documentation of MFENCE causing some stir-up among JVM developers. As it turned out, the documentation that developers relied on from the beginning was not so precise in the first place.

Locks in .NET result in an implicit memory barrier, so you are safe using them (most of the time, that is... see for example this Joe Duffy - Brad Abrams - Vance Morrison greatness on lazy initialization, locks, volatiles and memory barriers. :) (Be sure to follow the links on that page.)

As an added bonus, you will get introduced to the .NET memory model on a side quest. :)

There is also an "oldie but goldie" from Vance Morrison: What Every Dev Must Know About Multithreaded Apps.

...and of course, as @Eric mentioned, Joe Duffy is a definitive read on the subject.

A good STM can get as close to fine-grained locking as it gets and will probably provide a performance that is close to or on par with a hand-made implementation. One of them is STM.NET from the DevLabs projects of MS.

If you are not a .NET-only zealot, Doug Lea did some great work in JSR-166.
Cliff Click has an interesting take on hash tables that does not rely on lock-striping - as the Java and .NET concurrent hash tables do - and seem to scale well to 750 CPUs.

If you are not afraid to venture into Linux territory, the following article provides more insight into the internals of current memory architectures and how cache-line sharing can destroy performance: What every programmer should know about memory.

@Ben made many comments about MPI: I sincerely agree that MPI may shine in some areas. An MPI based solution can be easier to reason about, easier to implement and less error-prone than a half-baked locking implementation that tries to be smart. (It is however - subjectively - also true for an STM based solution.) I would also bet that it is light-years easier to correctly write a decent distributed application in e.g. Erlang, as many successful examples suggest.

MPI, however has its own costs and its own troubles when it is being run on a single, multi-core system. E.g. in Erlang, there are issues to be solved around the synchronization of process scheduling and message queues.
Also, at their core, MPI systems usually implement a kind of cooperative N:M scheduling for "lightweight processes". This for example means that there is an inevitable context switch between lightweight processes. It is true that it is not a "classic context switch" but mostly a user space operation and it can be made fast - however I sincerely doubt that it can be brought under the 20-200 cycles an interlocked operation takes. User-mode context switching is certainly slower even in the the Intel McRT library. N:M scheduling with light-weight processes is not new. LWPs were there in Solaris for a long time. They were abandoned. There were fibers in NT. They are mostly a relic now. There were "activations" in NetBSD. They were abandoned. Linux had its own take on the subject of N:M threading. It seems to be somewhat dead by now.
From time to time, there are new contenders: for example McRT from Intel, or most recently User-Mode Scheduling together with ConCRT from Microsoft.
At the lowest level, they do what an N:M MPI scheduler does. Erlang - or any MPI system -, might benefit greatly on SMP systems by exploiting the new UMS.

I guess the OP's question is not about the merits of and subjective arguments for/against any solution, but if I had to answer that, I guess it depends on the task: for building low level, high performance basic data structures that run on a single system with many cores, either low-lock/"lock-free" techniques or an STM will yield the best results in terms of performance and would probably beat an MPI solution any time performance-wise, even if the above wrinkles are ironed out e.g. in Erlang.
For building anything moderately more complex that runs on a single system, I would perhaps choose classic coarse-grained locking or if performance is of great concern, an STM.
For building a distributed system, an MPI system would probably make a natural choice.
Note that there are MPI implementations for .NET as well (though they seem to be not as active).

andras
That's a typical way of converting code that relies on locks to be "lock-free". It's not at all typical of code designed from the ground up to avoid locks, which frequently use some form of producer/consumer queues for message passing.
Ben Voigt
@Ben: That's an illusion, if you look deep - these systems use their own internal structures for message passing and resource scheduling. These structures use locking or the above typical "lock-free" techniques. As a good example, Erlang is built from the ground up with MPI in mind. It still uses locking at the lowest level. It turns out that it used one big fat lock for a global process queue. This created scalability problems, however. Now they plan to use multiple locks instead of one: http://stackoverflow.com/questions/605183/how-if-at-all-do-erlang-processes-map-to-kernel-threads
andras
@Ben: ...so in the end, someone, somewhere, sometime will have to write a low-lock/"lock-free" queue implementation for Erlang - running into the same problems as above...;)
andras
@andras: But only queues which have more than one writer require a lock (whether software lock, or hardware bus lock during an interlocked operation as you mentioned). Yes, there's some shared state written by multiple producers in any interesting system, but that can be made wait free (producer makes the receiver runnable, doesn't matter what the prior state was, doesn't matter whether someone else makes the receiver runnable as well). But more importantly, shared state is minimized in a system designed for message passing, so you don't pay for the cache sync very often at all.
Ben Voigt
I guess what I'm saying is that some of the steps are the same. But it's possible to get rid of the (make a copy) and the (retry) steps with a lock-free wait-free message queue and those are the bulk of the costs of parallelism.
Ben Voigt
@Ben: I think I see what you mean. You can in fact do something very similar with C++ on Windows 7, have a look at ConCRT: http://msdn.microsoft.com/en-us/library/dd998048(VS.100).aspx. It uses cooperative scheduling of lightweight tasks. It closely resembles MPI at the lowest levels. N:M scheduling with lightweight tasks was done before, it is not new. It has its own set of problems, however. http://en.wikipedia.org/wiki/Thread_(computer_science)#N:M_.28Hybrid_threading.29.
andras
@andras: I didn't claim I had production-grade code for that. Note that feasibility and availability of production-ready code are very different things. Since my work is on medical devices, I'm much more concerned with correctness such as the deadlock-free characteristics of lock-free code than the performance, but I know some of the theory as well. A multiple-producer single-consumer queue can be built from a set of parallel single-producer queues and a shared event (to set an event from multiple threads requires no retry). If connectivity is sparse, naive polling may perform acceptably.
Ben Voigt
(continued) If many producers must feed a single consumer, polling can be accomplished in log or log-log time with the use of flags for entire groups of queues. This doesn't scale very well in memory usage, but there's always some form of tradeoff. Since many systems can be designed with sparse connectivity, I understand there's a lot of applicability to this sort of queue implementation even if it isn't a universal solution.
Ben Voigt
I guess I'm not explaining myself properly. One single scheduler-integrated event wakes the consumer. The consumer then has to discover which of the incoming queues contains data. For sparse connectivity, an exhaustive search (in linear time) is ok. That's what I called "polling". But the polling only happens when at least one queue has data, so it's not continually burning CPU. For larger numbers of incoming queues, flags indicate the potential presence of data in a group of queues, allowing the search to take place in log time. And groups of groups give even better scalability.
Ben Voigt
@Ben: Be warned. It's only a braindump. There can be severe misunderstandings from my part about your ideas. :) When and how the wake-up event is reset? What kind of event will it be? E.g. the consumer wakes up, runs through the queues, finds all items that are queued and consumes them. The problem I see is that it cannot stop, since the event stays signaled - producers only set it - and the consumer does not see all the queues "at once" to know when to reset it and assert that: "now I have dequeued all items". New items could have arrived meanwhile and their signal would be lost.
andras
@Ben: using a semaphore with a count would alleviate this, I think.But then again: the whole "wait-free"-ness is not proper in the first place imho. Taking your sentence: "to set an event from multiple threads requires no retry". True. It might however effectively serialize access to a shared memory location - so instead of a linear speedup and nice parallelization, your performance will be closer to the uniprocessor figure. (Meanwhile I have managed to find some queue implementation: http://www.thedelphigeek.com/2008/07/lock-free-queue-finally.html. So please, take my apologies on that. :)
andras
@andras: I'm assuming an event which is reset atomically with wake. Maybe that isn't compatible with the wait-free requirement, but I think it can be done using InterlockedExchange (not InterlockedCompareExchange, which could fail and need retry).
Ben Voigt
@Ben: Sadly, no matter how one twists it, in reality even InterLockedExchange will not be compatible with a strict wait-free-ness guarantee. It will not scale if multiple threads from multiple cores write to the same memory location. It waits. If you think about it, it is natural. After all, *you really do want to serialize access*.That is the most fundamental reason why e.g. any rw lock that uses a single memory location will not scale for fine-grained locking even if you have 0 writers, no matter what cunning `lock xor`, `lock or`, `lock and`, etc. operations they employ.
andras
@Ben: e.g. http://www.bluebytesoftware.com/blog/2009/02/12/ReaderwriterLocksAndTheirLackOfApplicabilityToFinegrainedSynchronization.aspx
andras
@Ben: in fact, it is written in the assembly code: `"lock"`. :-)Now how the heck is someone supposed to build a lock-free, wait-free data structure out of that? :)(To preempt debates: `xchg` == `lock xchg`. It implies a lock. Doesn't matter if you write it down, or not.)
andras