views:

487

answers:

10

Today I came across this question:

you have a code

static int counter = 0;
void worker() {
    for (int i = 1; i <= 10; i++)
        counter++;
}

If worker would be called from two different threads, what value will counter have after both of them are finished?

I know that actually it could be anything. But my internal guts tells me, that counter++ will most likely be translated into single assembler instruction, and if both threads are execute on the same core, counter will be 20.

But what if those threads are run on different cores or processors, could there be a race condition in their microcode? Is one assembler instruction could always be viewed as atomic operation?

A: 

I think that you'll get a race condition on access.

If you wanted to ensure an atomic operation in incrementing counter then you'd need to use ++counter.

ChrisBD
it doesn't really matter with modern compilers, it's the same operation.
vava
+3  A: 

No you cannot assume this. Unless it clearly stated in compiler specification. And moreover no one can guarantee that one single assembler instruction indeed atomic. In practice each assembler instruction is translated to number of microcode operation - uops.
Also the issue of race condition is tightly coupled with memory model(coherence, sequential, release coherence and etc.), for each one the answer and result could be different.

Artem Barger
+5  A: 

Not always - on some architectures one assembly instruction is translated into one machine code instruction, while on others it does not.

In addition - you can never assume that the program language you are using is compiling a seemingly simple line of code into one assembly instruction. Moreover, on some architectures, you cannot assume one machine code will execute atomically.

Use proper synchronization techniques instead, dependent on the language you are coding in.

Yuval A
explain the -1 please
Yuval A
Why the downvote?
musicfreak
One assembly instruction can be translated to many microcode instructions
Artur Soler
I know I can't assume anything, I wrote that in the question :)
vava
@Yuval, I downvoted because of this: "**Yes** - on most architectures one assembly instruction is translated into one machine code instruction." That is *probably* true for pure **RISC** architectures, but most definitely not true for **CISC** or even hybrid architectures like x86.
Nathan Fellman
Fair enough - since most common processors today are RISC, I edited my answer to reflect that
Yuval A
Ok. In that case I'll remove the downvote :)
Nathan Fellman
I mean "most common processors are CISC"...
Yuval A
+4  A: 
  1. Increment/decrement operations on 32-bit or less integer variables on a single 32-bit processor with no Hyper-Threading Technology are atomic.
  2. On a processor with Hyper-Threading Technology or on a multi-processor system, the increment/decrement operations are NOT guaranteed to be executed atomicaly.
Darin Dimitrov
this bit I know, I was wondering, what could happen? Microcode would get into race conditions?
vava
Yes, in case 2.) you have a race condition and the result is not guaranteed to always be 20.
Darin Dimitrov
This is wrong. Even on a single threaded single CPU system you can still have external agents updating memory via DMA or PCI. Even a simple _inc memory_ instruction shouldn't be assumed to be atomic. You still have to use _lock inc memory_
Nathan Fellman
Agreed with you @Nathan, in both cases a lock is necessary.
Darin Dimitrov
Nathan Fellman is correct, increment on shared memory is not guaranteed to be atomic unless you use a memory lock, even on single core CPUs, your thread can be preempted between the read and the increment or some other mechanism like DMA can create a race with your update.
Pop Catalin
By the way, @darin, there's all this talk about explaining downvotes. I downvoted you because the answer is incorrect.
Nathan Fellman
+2  A: 

Another issue is that if you don't declare the variable as volatile, the code generated would probably not update the memory at every loop iteration, only at the end of the loop the memory would be updated.

Artur Soler
very good point, haven't thought about this
vava
+4  A: 
jop
`inc m32` is valid in x86
Nathan Fellman
Thanks for the correction Nathan.
jop
+6  A: 

The answer is: it depends!

Here is some confusion around, what an assembler instruction is. Normally, one assembler instruction is translated into exactly one machine instruction. The excemption is when you use macros -- but you should be aware of that.

That said, the question boils down is one machine instruction atomic?

In the good old days, it was. But today, with complex CPUs, long running instructions, hyperthreading, ... it is not. Some CPUs guarantee that some increment/decrement instructions are atomic. The reason is, that they are neat for very simple syncronizing.

Also some CPU commands are not so problematic. When you have a simple fetch (of one piece of data that the processor can fetch in one piece) -- the fetch itself is of course atomic, because there is nothing to be divided at all. But when you have unaligned data, it becomes complicated again.

The answer is: It depends. Carefully read the machine instruction manual of the vendor. In doubt, it is not!

Edit: Oh, I saw it now, you also ask for ++counter. The statement "most likely to be translated" can not be trusted at all. This largely depends also on the compiler of course! It gets more difficult when the compiler is making different optimizations.

Juergen
I upvoted for "Some CPUs guarantee that some increment/decrement instructions are atomic" with my emphasis on *some*. Also, note that even a simple fetch isn't necessarily atomic. For instance, if it crosses a cacheline boundary it may be done in two parts. Even for aligned data if the data type is larger than the cacheline.
Nathan Fellman
Hi Nathan, thanks for the comment! I did not think about cachelines. But you are right, this can also be an additional complication. I would say, that one cacheline is (mostly) a multiple of the normal word size (32/64 bit), isn't it? That is the reason I added "that the processor can fetch in one piece").
Juergen
+1  A: 

Might not be an actual answer to your question, but (assuming this is C#, or another .NET language) if you want counter++ to really be multi-threaded atomic, you could use System.Threading.Interlocked.Increment(counter).

See other answers for actual information on the many different ways why/how counter++ could not be atomic. ;-)

peSHIr
+15  A: 

Specifically for x86, and regarding your example: counter++, there are a number of ways it could be compiled. The most trivial example is:

inc counter

This translates into the following micro operations:

  • load counter to a hidden register on the CPU
  • increment the register
  • store the updated register in counter

This is essentially the same as:

mov eax, counter
inc eax
mov counter, eax

Note that if some other agent updates counter between the load and the store, it won't be reflected in counter after the store. This agent could be another thread in the same core, another core in the same CPU, another CPU in the same system, or even some external agent that uses DMA (Direct Memory Access).

If you want to guarantee that this inc is atomic, use the lock prefix:

lock inc counter

lock guarantees that nobody can update counter between the load and the store.


Regarding more complicated instructions, you usually can't assume that they'll execute atomically, unless they support the lock prefix.

Nathan Fellman
+1: some nice dive into x86 assembly specifics!
Juergen
A: 

On many other processors, the seperation between memory system and processor is bigger. (often these processor can be little or big-endian depending on memory system, like ARM and PowerPC), this also has consequences for atomic behaviour if the memory system can reorder reads and writes.

For this purpose, there are memory barriers (http://en.wikipedia.org/wiki/Memory%5Fbarrier)

So in short, while atomic instructions are enough on intel (with the relevant lock prefixes), more must be done on non-intel, since the memory I/O might not be in the same order.

This is a known problem when porting "lock-free" solutions from Intel to other architectures.

(Note that multiprocessor (not multicore) systems on x86 also seem to need memory barriers, at least in 64-bit mode.

Marco van de Voort
Regarding your edit, why do you limit yourself only to 64-bit mode? Why do you limit yourself to multi-processor systems? In fact, not only can you not assume any ordering on mutli-core systems, you also can't for single-core, multi-threaded systems.
Nathan Fellman
Afaik on multicore (but not multiprocessor) systems the common cache and cache coherency causes a synchronous memory system. IIRC the 64-bit was due to the fact that when a 64-bit memory load crossed a cacheline boundery it took double the memory cycle to fetch. But read the Intel or AMD ABI/processor manuals (search for sfence), and see for yourself. I'm just trying to regurgitate what I remember without renewing the research.
Marco van de Voort