views:

312

answers:

8

How do I prevent a race condition WITHOUT locking or using mutexes/semaphors in C++? I'm dealing with a nested for loop in which I will be setting a value in an array:

for (int i = 0; i < m; ++i)
  for (int j = 0; j < n; ++j)
    for (int k = 0; k < o; ++k)
      array[k] += foo(...);

More or less, I want to deal with this so that I can ensure different threads running at the same time don't write to array[k] at the same time. Any suggestions on how to approach this?

Edit: I am running on a Linux machine and I also have to use the Intel compiler. I will be using "icc" instead of "gcc" to compile the code.

+2  A: 

Assuming the array holds integers, use gcc's atomic builtins. __sync_fetch_and_add should do the trick.

Marcelo Cantos
for this specific program, i will be using intel's compiler. i'm not sure if that means i can't use "gcc's atomic builtins"
Hristo
+5  A: 

For this particular loop, turn it inside out. Put k on the outside, then you can farm k=0 out to a different thread than k=1 and so on.

As long as foo() doesn't depend on array[k] where k != it's current k, then you're golden.

sblom
I like your idea. I don't quite understand it though. What do you mean by "As long as foo() doesn't depend on array[k] where k != it's current k"?
Hristo
Watch out for [false sharing](http://en.wikipedia.org/wiki/False_sharing). The cache can work against you.
Tim Kryger
Oh, I just mean that if `foo()` depends on `array[0]` when it's trying to computer `array[1]`, then you're going to have even tougher synchronization problems to solve than just using Marcelo's suggestion or something like it.
sblom
no... foo() DOES NOT depend on array[index]. So how does turning it inside out help my situation, conceptually?
Hristo
What it does is that it gives you `o` different units of work that are each independent from each other. You can farm out identical loops through `i=0..m, j=0..n` to different threads, one per different value for `k`.
sblom
Thanks. I'll give it a try.
Hristo
+3  A: 

Assuming windows and that array contains elements of type LONG you could do something like:

for (int i = 0; i < m; ++i) 
   for (int j = 0; j < n; ++j) 
      for (int k = 0; k < o; ++k)  {
          LONG val = foo(...);
          InterlockedAdd( &array[k], val);
      }

If you're not working in Windows your platform may have a similar set of APIs. As long as your platform has an InterlockedCompareExchange() type of API you can write your own version of InterlockedAdd().

Something like the following (disclaimer - untested):

 int InterlockedAdd( int volatile* pDest, int operand)
 {
      int curval = *pDest;
      int oldval;

      do {
           oldval = curval;
           curval = InterlockedCompareExchange( pDest, oldval + operand, oldval);
      } while (curval != oldval);

      return oldval+operand;
 }

As far as I know, Boost only has limited support for atomic/interlocked operations, apparently only enough to support atomic manipulation of reference counts. I don't think that the support for interlocked operations in Boost is more than implementation detail (I'm currently dealing with an somewhat older version of Boost, so it's possible that this isn't the case anymore).

There are some portable libraries that support atomic compare and exchange and other atomic operations as documented parts of the interface:

Also note that C++0x will have support for atomic operations like compare/exchange - I'm not sure what the level of support is in current C++ compilers (it doesn't appear to being VS 2010).

Michael Burr
Do you know if there is a boost function for InterlockedAdd or InterlockedCompareExchange?
Inverse
Missing a closing curly brace on the do while loop.
titaniumdecoy
Fixed the brace, added a little about various library support for atomic operations.
Michael Burr
A: 

Partition the innermost loop among the threads. Thread T1 processes indices in the range [0,L), thread T2 processes indices in the range [L, 2L), etc. L=o/n where n is the number of threads. This assumes that the foo() call does not use other locations that might be concurrently computed.

EDIT: using interlocked operations, as others have suggested, will give correct result but it might degrade performance badly. (If the inner loop is short, many threads will be competing for few memory locations, which will effectively serialize your program.)

zvrba
Since I'm trying to boost the performance for this program, using interlocked operations is not a good idea.
Hristo
A: 

Easiest way (though not the most efficient!) is to wrap the loop in a "critical" section.

See here Wikipedia this will restrict the loop code to a single thread at any given time.

James Anderson
+2  A: 

The way you want it, it cannot be done! (see the comment of sbi)

You could use a shared memory, but there will still be locks.
You could also use just one thread for writing and reading to the array. If you think it's easier to set up the correct protocol for it, go on.

Anyway, there are already good solutions given using locks (either directly or indirectly). Just pick one :)

Burkhard
A: 

Looks like this specific scenario, in which you have nested loops providing the index to access your array, you can work with a different range of indices in different threads, thus completely avoiding locking.

jweyrich
A: 

using interlocked operations, as others have suggested, will give correct result but it might degrade performance badly. (If the inner loop is short, many threads will be competing for few memory locations, which will effectively serialize your program.) link|flag

Not true, my friend. Actually interlocked operations are superior to all kinds of locking. More precisely: all synchronization objects (critical sections, mutexes, events, etc.) are definitely implemented in terms of interlocked operations. In fact interlocked operations are the only processor instructions that may guarantee the synchronization consistency. It's simply impossible to implement a synchronization object without using interlocked operations at all.

The other question is the scope of locking. You probably wanted to say that the synchronization scope (implemented either with sync objects or directly with interlocked operations) inside an inner loop may lead to the performance degradation, since it's executed many times.

Well, this is true. But on the other hand you lock only what you need only for the needed duration. Hence you prevent potential synchronization collisions.

valdo