views:

1065

answers:

5

Summary:

It seems to me that:

  1. wrapping fields representing a logical state into a single immutable consumable object
  2. updating the object's authoritative reference with a call to Interlocked.CompareExchange<T>
  3. and handling update failures appropriately

provides a kind of concurrency that renders the "lock" construct not only unnecessary, but a truly misleading construct that dodges certain realities about concurrency and introduces a host of new problems as a result.

Problem Discussion:

First, let's consider the main problems with using a lock:

  1. Locks cause a performance hit, and must be used in tandem for reading and writing.
  2. Locks block thread execution, hindering concurrency and risking deadlocks.

Consider the ridiculous behavior inspired by the "lock". When the need arises to update a logical set of resources concurrently, we "lock" the set of resources, and we do so via a loosely associated, but dedicated locking object, which otherwise serves no purpose (red flag #1).

We then use the "lock" pattern to mark-off a region of code where a logically consistent state change on a SET of data fields occurs, and yet we shoot ourselves in the foot by mixing the fields with unrelated fields in the same object, while leaving them all mutable and then forcing ourselves into a corner (red flag #2) where we have to also use locks when reading these various fields, so we don't catch them in an inconsistent state.

Clearly, there's a serious problem with that design. It's somewhat unstable, because it requires careful management of the lock objects (locking order, nested locks, coordination among threads, blocking/waiting on a resource in use by another thread that's waiting for you to do something, etc.), which depends on the context. We also hear people talk about how avoiding deadlock is "hard", when it's actually very straightforward: don't steal the shoes of a person you plan on asking to run a race for you!

Solution:

Stop using "lock" altogether. Properly roll your fields into an incorruptible/immutable object representing a consistent state or schema. Perhaps it's simply a pair of dictionaries for converting to and from display-names and internal-identifiers, or maybe it's a head node of a queue containing a value and a link to the next object; whatever it is, wrap it into it's own object and seal it for consistency.

Recognize write or update failure as a possibility, detect it when it occurs, and make a contextually informed decision to retry immediately (or later) or do something else instead of blocking indefinitely.

While blocking seems like a simple way to queue a task that seems like it must be done, not all threads are so dedicated and self-serving that they can afford to do such a thing at the risk of compromising the entire system. Not only is it lazy to serialize things with a "lock", but as a side affect of trying to pretend a write shouldn't fail, you block/freeze your thread, so it sets there unresponsive and useless, forsaking all other responsibilities in its stubborn wait to accomplish what it set out to do some time earlier, ignorant of the fact that assisting others is sometimes necessary for fulfilling it's own responsibilities.

Race conditions are normal when independent, spontaneous actions are occurring simultaneously, but unlike uncontrolled Ethernet collisions, as programmers we have total control over our "system" (i.e. deterministic digital hardware) and its inputs (no matter how random, and how random can a zero or one really be?) and outputs, and the memory that stores our system's state, so livelock should be a non-issue; furthermore, we have atomic operations with memory barriers that resolve the fact that there may be many processors operating concurrently.

To summarize:

  1. Grab the current state object, consume its data, and construct a new state.
  2. Realize that other active threads will be doing the very same thing, and may beat you to it, but all observe an authoritative reference point representing the "current" state.
  3. Use Interlocked.CompareExchange to simultaneously see if the state object you based your work on is still the most current state, and replace it with your new one, otherwise fail (because another thread finished first) and take appropriate corrective action.

The most important part is how you handle the failure and get back on your horse. This is where we avoid livelocks, thinking too much and not doing enough or doing the right thing. I would say that locks create the illusion that you'll never fall off your horse, despite riding in a stampede, and while a thread daydreams in such a fantasy land, the rest of the system can fall apart and crash and burn.


So, is there something the "lock" construct can do that can't be achieved (better, in a less unstable fashion) with a lock-free implementation using CompareExchange and immutable logical state objects?

All of this is a realization I've come to on my own after dealing with locks intensely, but after some searching, in another thread http://stackoverflow.com/questions/1385468/is-lock-free-multithreaded-programming-making-anything-easier, someone mentions that lock-free programming is going to be very important when we face highly parallel systems with hundreds of processors, were we cannot afford to use highly contended locks.

+1  A: 

I'd like to know how you would perform this task using your lock free programming style? You have a number of worker threads all periodically hitting a shared lists of tasks for the next job to perform. (currently) They lock the list, find the item at the head, remove it, and unlock the list. Please take into account all error conditions and possible data races so that no two threads can end up working on the same task, or that a task is accidentally skipped.

I suspect that the code to do this may suffer from a problem of over-complexity and have a possibility of poor performance in the case of high contention.

1800 INFORMATION
This is quite easy ... take the queue, make a copy, dequeue one work item, and replace the original queue with the modified copy (if the original queue is still present). There is of course quite a bit of overhead and every thread might fail several times (up to infinity) because other threads concurrently dequeued a work item.
Daniel Brückner
My point being that this code is somewhat more complex (thus more susceptible to error), and not really solving the issue of performance under load, and suffering from some serious fairness issues.
1800 INFORMATION
The compare-exchange-solution has the advantage that it will not block all other threads if a thread halts (while it owns the lock). Skipping a work item might happen in both versions, if a thread halts after it has dequeued a work item.
Daniel Brückner
Sorry just a question of terminology, what do you mean by "halt"? Do you mean as in the OS interrupting because of the end of timeslice, or something more terminal?
1800 INFORMATION
I wanted to avoid "if a thread crashes" because I wanted to include everything - a thread abort, end of time slice and not obtaining a new for a long time, waiting for I/O, ... the case of skipping a task occurs of course only when aborting or suspending a thread for ever. While this seems unavoidable, there a solutions where other threads complete the work of halted threads (in a limited fashion).
Daniel Brückner
Easy! Reverse it. Let default thread state be "wait to be assigned work" instead of "contend for work". Have an assignment thread which dequeues an item, sets it aside, and the signals the worker to have at it. The only possible contention then is between dequeuing and enqueuing items, BUT since this happens at opposite ends of the queue, no contention will occur as long as the cue has items in it. Meanwhile, queuing incoming requests is a non-issue as well, because there will always be a winner and therefore a steady flow into the queue: prep prep prep, try queue, try queue, prep prep...
Triynko
You are just serializing the access to the queue by using a single thread to enqueue or dequeue the items. This has (almost) no advantage over using a lock and adds additional complexity by adding a new coordination thread. Further you are left with the problem of multiple threads trying to enqueue new work items - how would you handle this in an easy way?
Daniel Brückner
Access to a queue is serial by nature...first in, first out. Not first in, second out. So I'm not sure what you mean. Many threads will be processing tasks, but just one thread handles reading the queue and activating other threads. It's a more economical use of processor cycles to have one thread perform the dequeuing and enlist other threads, than it is to have ALL the threads wasting time checking for stuff to do, when they could be resting (freeing processor cycles for busy threads) or doing actual work.
Triynko
All threads waiting for a lock don't continuously check the lock state - they are waiting in a queue managed by the runtime or operating system and in consequence don't consume any processor cycles. So your activation thread is already there in the form of the lock manager and thread scheduler. And what about the second problem - multiple threads trying to enqueue work items? Would you suggest to use one queue per thread?
Daniel Brückner
Multiple threads trying to enqueue: read x, add item, try compare/exchange, repeat. One will always be first. Items will be added as fast as atomic exchanges can be made, in a steady stream. Queue x = new Queue() [Queuing thread: read x, update x, compare/exchange x] [Dequeuing thread: read x, create new Queue(), compare/exchange x with new Queue()]. So you see, the Dequeing thread periodically STEALS the queue and replaces it with an empty one. The dequeuing thread MERGES the queues it has stolen, which it keeps private, and uses to activate sleeping thread in the mean time.
Triynko
Stealing the queue is a good idea to lower contention I have never thought about before. But I assume the enqueue process could easily live lock under high contention and with a long queue requiring some time to create a copy. You should prototype and profile that.
Daniel Brückner
In general, one can always use PULL instead of PUSH, to keep things fair. Suppose hardware can buffer few thousand requests and present 4 at a time in "slots". Four processors each work with one slot, sychronized with queuing hardware via clock cycle, so they take turns [test empty/present]--[test empty/read]. So one real "thread" per queue. The coordinating hardware or thread then, in a fair fashion, reads and assigns available items from each of the four slots, one at a time, sending each to an available worker, then loops.
Triynko
Any parallel system must eventually reduce to a single management thread in order to subdue chaos. In the brain, it's consciousness. The system CAN and WILL function without it, but not as efficiently, and live locks can and will occur otherwise. Of course, as I've said, there will always be one winner, so as fast as compare/exchange can occur, there will be a steady flow into the queue. Stealing the queue makes it smaller, so even though there are two different VIEWS of it... there is truely only one REAL queue.
Triynko
+2  A: 

Your compare-exchange-suggestion has one big drawback - it is not fair because it favors short tasks. If there are many short tasks in a system, chances for a long task to ever complete may get very low.

Daniel Brückner
Nothing is left to chance in deterministic hardware (how random can a zero or one really be?). Everything can be handled. Some part of the system can be reduced to a lock-free queue, in which case, the "long" operations we're talking about probably need to be serialized anyway. If not, then again, it's the "what to do when update fails" question. Depending on the type of work done up front, checking whether the work is still valid may in fact required just a few simple checks, so while the original attempt was a long operation, the RETRY may in fact be a very short operation.
Triynko
Computers are not deterministic if they operate in an uncontrolled environment - a large range of influences from radiation corrupting a bit in memory and trigger error correction hard- or software, over variations of oscillator frequencies due to temperature changes, up to (user) input that arrives from mouse, keyboards, microphones, cameras, or network connection at unpredictable times makes it practically impossible to have a completely deterministic system.
Daniel Brückner
Take the following example. There is a list and some threads insert items into the list while other threads remove items matching a condition. The insert could be very fast while removing could be linear. In consequence your solution would favor adding new items and this will in turn make the removal taking even longer because the list grows and again favor inserts more.
Daniel Brückner
If I smash the computer, I don't have a computer any more than if radiation corrupts a bit. However, as long as we DO have a computer, it is deterministic, including inputs. For example... for every BIT that enters the system, I KNOW it's a zero or one. With branch, I handle BOTH possible inputs. When you scale this up to multiple bits, the principle doesn't go away. All bugs... stack overflows, buffer overflows, ARE avoidable; some programmers are just ignorant. It IS that simple. If 0-0xFFFFFFFF are not ALL valid, then don't use "int", create immutable constrained data class; pass it.
Triynko
Computers are not zeros and ones - they are analog. No bit is ever zero or one. It is below 1.2 volts or above. And this values are influenced by environment factors you have no control over. You could - in theory - try to build a system that handles all states correctly, but this is practically impossible. You have to ensure the correct behavior of billions of transistors, millions of lines of code in operating systems grown over decades, ...
Daniel Brückner
Of course. At some point, at some layer we've built on that analog system, we have to assume a digital system for writing software. If we can't depend on mathematical operations being well-defined, then then we're moving into the realm of unrecoverable errors. Most errors are not fatal in such a way that they invalidate the design of the software. The question is... what's within the range of expected values, and is the software designed to account for all possibilities? If it's not, then the design is incomplete, and unhandled conditions can cause crashes or undefined behaviors.
Triynko
Oh, sorry, I guess I DID use the phrase "deterministic hardware". I actually meant "system", so I was referring to the system's we write software on, not directly to the hardware layer, which is, of course, analog.
Triynko
In any truly asynchronous environment, there's bound to be some level of non-determinism. That's a big part of what makes multithreading hard.
kyoryu
Yes, there is some level of non-determinism, because the threads are running independently of one another; however, the whole point of having locks and other structures is to RESOLVE to a shared, consistent state in a deterministic manner... thereby eliminating any level of non-determinism.
Triynko
+1  A: 

There are four conditions for a race to occur.

  1. The first condition is that there are memory locations that are accessible from more than one thread. Typically, these locations are global/static variables, or are heap memory reachable from global/static variables.
  2. The second condition is that there is a property (often called an invariant), which is associated with these shared memory locations that must be true, or valid, for the program to function correctly. Typically, the property needs to hold true before an update occurs for the update to be correct.
  3. The third condition is that the invariant property does not hold during some part of the actual update. (It is transiently invalid or false during some portion of the processing).
  4. The fourth and final condition that must occur for a race to happen is that another thread accesses the memory while the invariant is broken, thereby causing inconsistent or incorrect behavior.

    1. If you don't have a shared memory location which is accessible from multiple threads, or you can write your code to either eliminate that shared memory variable, or restrict access to it to only one thread, then there is no possibility of a race condition, and you don't need to worry about anything. Otherwise, the lock statement, or some other synchronization routine is absolutely necessary and cannot be safely ignored.

    2. If there is no invariant (let's say all you do is write to this shared memory location and nothing in the thread operation reads it's value) then again, there is no problem.

    3. If the invariant is never invalid, again no problem. (say the shared memory is a datetime field storing the datetime of the last time the code ran, then it can't be invalid unless a thread fails to write it at all...

    4. To eliminate nbr 4, you have to restrict write access to the block of code that accesses the shared memory from more than one thread at a time, using lock or some comparable synchronization methodology.

The "concurrency hit" is in this case not only unavoidable but absolutely necessary. Intelligent analysis of what exactly is the shared memory, and what exactly is your critical "invariant" allows you to code the system to minimize this concurrency "Hit". (i.e., maximize concurrency safely. )

Charles Bretana
I agree. I actually cache temporal object hierarchies, object schemas, and object permission set in my code to perform hierarchical permission checks on parented objects, then resolve the permissions list, checking for deny, then favoring most-specific permission on the object itself, then up the hierarchy. The concurrency "hit" is unavoidable; but I think "lock" takes the hit worse than Compare/Exchange, because its easy to design it wrong and run into deadlock, which is essentially an infinite loop waiting for a condition that will never occur.
Triynko
I cached that stuff, because trying to hit the database every time is multiple orders of magnitude slower. The caching and use of the cache takes milliseconds, whereas the same process without the cache takes 30 or more seconds. It's a HUGE gain. When an object is reparented and a new entry going into the temporal hierarchy, these caches are reconstructed anew in the background, then compared/exchanged instantly. Threads can continue to work with old copies, but it's FINE, because when they write to the database, it will FAIL if it actually breaks the relational integrity. It's nice.
Triynko
@Triynko, deadlocks can only occur if two different threads in different blocks of code attempt to 'lock' two different objects in the opposite order. (i.e., one block of code locks A, then B, and the other locks B, then A... ) Leaving aside the issue of designing synchronization code to use two different lock references (which is the whole complex topic of synchronization object granularity), this is not hard to avoid. Understanding the nature of your shared memory and invariants, and their relationship, is sufficient to design systems which are deadlock-Safe.
Charles Bretana
@triynko, RI in the DB is not the only shared memory that can cause concurrency issues. In fact, DB concurrency is a completely different, (and much easier to solve) problem than concurrency issues caused by multi-threading code on a client. caching data is not a solution if multiple threads have concurrent unsynchronized access to the cache.
Charles Bretana
Yeah, avoiding deadlock is easy, that's one thing I always stress when people complain about it. It's just that in principle, locking and blocking is a waste of time; there's a much more efficient way to handle concurrency.
Triynko
One of my caches consists of three dictionaries. Since I'm concerned about the integrity of the set, not how current the set is (e.g. I don't care if permission changes don't effect an operation that started before they were changed), I put all three dictionaries in one container object, and when updating, I grab the original object, do the update, then I reconstruct the cache as a new container object with 3 new dictionaries, and then update the container reference with Interlocked.CompareExchange. An internal version number determines whether the update was successful and what happens next
Triynko
That will be ok as long as the update to the dictionarys' data is not in any way dependant on the pre-update value of the dictionary... If it is, then the read operation to determine the 'current' value, which will be used in determining the 'new' value, must be synchronized with all other threads that might try to read it. This is the more common situation. some shared memory must be read, and used to determine a new value to be written back into that shared memory. Between the read and the write some defined invariant is broken, and no other threads can access it safely.
Charles Bretana
That view only holds when three specific assumptions are true: 1. The data structure is mutable, 2. an update always succeeds, and 3. the structure can momentarily represent an inconsistent state. I am destroying those assumptions by declaring that an immutable data structure is used representing a consistent state, and an atomic compare-exchange (an inherently conditional operation that may or may not succeed) is used to publish the update. If the compare-exchange succeeds, then the pre-update data structure upon which the update is based has not changed and the update is valid.
Triynko
I'll talk about an unsuccessful update next, but first consider the consequences of a successful update. Instantaneously after a thread successfully makes an update... all other threads working with the same data are suddenly working with outdated, although logically consistent, data. That only time the outdatedness matters to the overall system state is when it attempts to publish the work done. Now... assume the update FAILS. All is not lost. First... had locks been used, the other threads would have been blocked doing NO WORK, potentially wasting processor cycles.
Triynko
Second, because of the logically consistent, although outdated, state of the structure operated upon, SOME OR ALL of the work done may still be valid, which puts this approach a step ahead of the locking/blocking approach which would have left multiple cores waiting around doing nothing at all. The result of this design is ultimately rapid bursts of successful updates to the shared state... thanks to having lots of work done in parallel on immutable structures, and rapid re-attempts at publishing valid work done to the shared global state.
Triynko
If you REALLY think about it... this approach parallels how millions of independent thinkers operate in reality, advancing the state of the humanity. We don't all wait around for someone else to solve a problem... we all work on it. Sometimes we find that someone has beaten us to the solution, but the work we did does not necessarily overlap, and therefore does not necessarily conflict and is not necessarily redundant. As a result, a SECOND rapid publication to the global state can occur, whereas alternatively we'd have been doing nothing, waiting for the first advance to be published.
Triynko
Finally, the immutable nature of the data structure means its always consistent and always available for consumption as long as timeliness isn't critical... which is the case in a lot of situations. Like any kind of publication, even though updates are constantly being made... at some point, you need to grab a snapshot of the information, burn it or print it and ship it out for consumption.
Triynko
1. How can the data in the dictionary be immutable if the purpose of this function is to update it ?? Your other comments betray a fundamental failure to grok the issue here. Which, again, is that the acccuracy of the value you are writing back into the dictionary is compromised by the fact that the algorithm used to calculate it did not have access to all the information it needed, to wit, the updates made to the value after your algorithm "read" the preceding value..
Charles Bretana
The simplest example is the bank account balance... If you are incrementing it, and you read a curr value of $50, and add $10, and write back the sum ($60) but while you were calculating some other thread has written $20 in here, in doing so you overwrite that value of $20 ... The correct value should have been $30, and you are writing $60 into the data.. which is wrong.
Charles Bretana
"rapid work done in immutable structures" ?? immutable means unchangeable, impossible to update, in other words, permanently locked. You cannot "do work" on an immutable data structure, all you can do is read it... If you are referring to the data structure your process is actually modifying or updating, then it is by definition, not immutable. The fact that you discuss how "current" this data structure is betrays the fact that it cannot be immutable.. Anything that can become non-current cannot be immutable. If it was immutable, it would, by definition always be current, forever.
Charles Bretana
No, you're completely missing what I said. The data set (a state) is composed of three immutable dictionaries, so the state is an immutible snapshot that doesn't require locking for reading. The state has as ID. When you want to update the state, you read it, work with the data, and constructed a new state. Because the state (snapshot) was immutible, your work done is guaranteed to be consistant, but NOT NECESSARILY correct.
Triynko
It's correctness depends on whether the state (snapshot) has been updated since your work began, which can be checked by an atomic compare/exchange on the state (snapshot). If it succeeds, the data you worked with to produce a new state is still current, so your new state is valid. Now here's what I meant about the potential for failure. Because other threads could be doing work, the state may have been updated, causing all other operations to be invalidated. HOWEVER, my point was that the work done in those other operations may be partially valid, so when they re-calculate...
Triynko
...and try to re-publish, the operation could be much shorter, so that not all work was done in vein. This is because checking to see if the work is still valid (by COMPARING the original source state, with the newly updated source state), it can be QUICKLY determined that the work done (that was theoretically invalid)... is actually still valid... or is partically valid and needs a slight alteration. In you're very simple contrived example, of course adding 50 + 10, and overwriting a value of 20 would be invalid, and that simple calculation would have to be redone, because IT WOULD FAIL...
Triynko
...using this system.. it wouldn't overwrite the 20... it would see that 20 is there, whereas a 50 was there when the calculation started. Because the state is an immutible object, an ENTIRE STATE is represented by a single object id, so it's easy to compare/exchange the state such that only valid new states are published. The fact that calculated new states are ALLOWED to fail (rather than blocking doing nothing), and the fact that invalid new states can be QUICKLY validated and altered without redoing ALL the work, is the advantage of this. Refer back to my research metaphor earlier.
Triynko
again, immutable means unchangeable. If the state is immutable, it is a contradiction to even talk about updating it.. you are either not accurately communicating what you are thinking (perhaps you are talking about 2 different things, one of which is imutable and the other is not?) or you don't understand what immutable means
Charles Bretana
Perhaps all along here you have been talking about an object (a dictionary) whose elements are themselves mutable, but whose presence in the dictionary cannot be changed. The dictionary copuld accurately be described as read-only, since the elements in it cannot be removed and new elements cannot be added, but to say that the dictionary is immutable when those elements themselves can be modified or updated is at best misleading, and at worst, not even wrong.
Charles Bretana
A: 

The big advantage of a lock over a CAS operation like Interlocked.CompareExchange is that you can modify multiple memory locations within a lock and all the modifications will be visible to other threads / processes at the same time.

With CAS, only a single variable is atomically updated. Lockfree code is usually significantly more complex because not only can you only present the update of a single variable (or two adjacent variables with CAS2) to other threads at a time, you also have to be able to handle "fail" conditions when CAS doesn't succeed. Plus you need to handle ABA issues and other possible complications.

There are a variety of methods like low locking, fine grain locking, striped locks, reader-writer locks, etc. that can make simple locking code much more multicore friendly.

That said, there are plenty of interesting uses for both locking and lockfree code. However, unless you REALLY know what you're doing creating your own lockfree code is not for a beginner. Use either lockfree code or algorithms that have been well proven and test them thoroughly because the edge conditions that cause failure in many lockfree attempts are very hard to find.

Adisak
"Variable" and "memory location" are not synonymous. A single "memory location" can be designated as a pointer, such that it actually represents an *immutable* data structure with an arbitrary number of "variables". That way, replacing the pointer (in a single memory location) to the data structure through a CAS operation effectively makes a whole new data structure (with an abritrary number of "variables") available to all threads instantly. Threads are even free to work with earlier "snapshots" of data structures while new versions of the structures are being built if synchronized properly.
Triynko
Snapshots of data structures do not work modifications that need to be done serially. Also, there are many data structures that are referenced through a single "pointer" that don't lend themselves well to this swap methodology. For Example, a Red-Black-Tree has a single pointer as a root-node but it makes much more sense to lock it and work on the tree than to generate an entirely new tree and update with a CAS pointer swap. The CAS-swapped snapshot works best on data that is either read-only or predominantly read. You also need garbage collection or reference counting for cleanup.
Adisak
A: 

I would say it is no more obsolete than saying, in general, that pessimistic concurrency is obsolete given optimistic concurrency, or that pattern A is obsolete because of pattern B. I think it's about context. Lock-free is powerful, but there may not be a point in applying it unilaterally because not every problem is perfectly suited to this. There are trade-offs. That said, it would be good to have a general purpose lockless, optimistic approach where it hasn't been realized traditionally. In short, yes, lock can do something that' can't be achieved with the other approach: represent a potentially simpler solution. Then again, it may be that both have the same result, if certain things don't matter. I suppose I'm equivocating just a bit...

Kit