Why is CompareAndSwap instruction considered expensive?
I read in a book:
"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."
Thanks!
Why is CompareAndSwap instruction considered expensive?
I read in a book:
"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."
Thanks!
That's because they introduce extra overhead for making the operation atomic. The underlying platform will have to suppress optimizations (like caching) and suspend threads execution for facilitating the barrier and that requires lots of extra effort. While that extra activity is in progress threads can't execute and so the overall program incurs a time delay.
In general, atomic operations are expensive because they require cross-CPU synchronization. A "normal" operation is allowed to operate on cached data, allowing extra speed. Take for example, on a 2 cpu system:
Thread 1
while (1) x++;
Thread 2
while (1) x++;
Because increment is not an atomic operation or protected by a memory barrier, the results of this are pretty much undefined. You don't know how x will be incremented, or it could even get corrupted.
Thread 1
while (1) atomicIncrement(&x);
Thread 2
while (1) atomicIncrement(&x);
Now, you are trying to get well defined behavior - no matter the ordering, x must increment one by one. If the two threads are running on different CPUs, they have to either reduce the amount of allowed caching or otherwise "compare notes" to make sure that something sensible happens.
This extra overhead can be quite expensive, and it's the general cause of the claim that atomic operations are slow.
I think I found the answer in my book:
Each getAndSet() is broadcast to the bus. because all threads must use the bus to communicate with memory, these getAndSet() calls delay all threads (cores), even those not waiting for the lock.
Even worse, the getAndSet() call forces other processors to discard their own cached copies of the lock, so every spinning thread encounters a cache miss almost every time, and must use the bus to fetch the new, but unchanged value.
"expensive" is very relative here. It's absolutely insignificant compared with, say, a harddisk access. But RAM bus speed has not kept up with the speed of modern CPUs, and compared with arithmetic operations inside the CPU, accessing the RAM directly (i.e. non-cached) is quite expensive. It can easily take 50 times as long to fetch an int
from RAM than to add two registers.
So, since memory barriers basically force direct RAM access (possibly for multiple CPUs), they are relatively expensive.
"CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#." - David Dice, Biased locking in HotSpot
"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."
This is quite true.
E.g. on x86, a proper CAS on a multi-processor system has a lock prefix.
The lock prefix results in a full memory barrier:
"...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.
A memory barrier is in fact implemented as a dummy LOCK OR
or LOCK AND
in both the .NET and the JAVA JIT on x86/x64.
On x86, CAS results in a full memory barrier.
On PPC, it is different. An LL/SC pair - lwarx
& stwcx
- can be used to load the memory operand into a register, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
It also does not mean an automatic full fence.
Performance characteristics and behaviour can be very different on different architectures.
But then again - a weak LL/SC is not CAS.