views:

551

answers:

4

x86 and other architectures provide special atomic instructions (lock, cmpxchg, etc.) that allow you to write 'lock free' data structures. But as more and more cores are added, it seems as though the work these instructions will actually have to do behind the scenes will grow (at least to maintain cache coherency?). If an atomic add takes ~100 cycles today on a dual core system, might it take significantly longer on the 80+ core machines of the future? If you're writing code to last, might it actually be a better idea to use locks even if they're slower today?

+4  A: 

For the question posed in the title, the short answer is "yes," the long answer is "it is complicated."

With regards to locks being better, no. Internally a lock has to push at least as much (if not more) traffic over the bus. Think about it this way, if the the processor only has one atomic operation, an atomic compare and swap, you could use it to implement locks and atomic increments. Well at a bus protocol level there are only a few primitives that are used. Locks are not slower than atomic operations because they are doing something different, they are slower because they are doing more of the same thing (from a coherency standpoint). So as atomic operations slow down, locks will tend to slow down comparably.

Having said that, there are lots and lots of papers on the subject and particular cases are complicated. I wouldn't worry about how your code is going to scale on 80 core CPUs that have unpredictable performance characteristics (because we don't know how they will be designed). Either they'll behave like our current CPUs and your code will perform fine, or they won't and whatever you guessed now will turn out to have been wrong. In most cases it will turn out the code wasn't performance sensitive anyway, so it doesn't matter, but if it does then the appropriate thing to do will be to fix it in the future when when you understand the architectural and performance characteristics of your target processors.

Louis Gerbarg
+2  A: 

I don't think the problem is that atomic operations will take longer themselves; the real problem might be that an atomic operation might block bus operations on other processors (even if they perform non-atomic operations).

If you want to write code to last, try to avoid locking in the first place.

Martin v. Löwis
If I understand you correctly, you're saying an atomic operation being executed on one processor can cause a slowdown on all other processors? So the ~100 cycle cost of a lock instruction isn't just paid on the current thread or processes CPU, but paid on all current flows of execution?
Joseph Garvin
IIUC, a lock instruction will cause a bus lock, blocking all memory operations on all processors as long as the lock is asserted (other processors can continue to use their local caches, still, IIUC).
Martin v. Löwis
That is correct, it's the primary way an atomic `cmpxchg` assures that the memory won't get updated "under its feet" by another cpu core, it forces a write cycle on the bus even if it's not actually writing.
Blindy
FWIW, not all CPUs implement atomics with BUSLOCK. Yes common Intel X86 family CPU's do use BUSLOCK, but PowerPC like on PS3 use MESI signaling with reserved cachelines rather than a BUSLOCK.
Adisak
+5  A: 

You are right that topology constraints will, one way or another, increase latency of communication between cores, once the counts start going higher than a couple dozen. I don't really know what the intentions are of the x86 companies for dealing with that sort of scaling.

But locks are implemented in terms of atomic operations. So you don't really win by trying to switch to them, unless they are implemented in a more scalable way than what you would be attempted with your own hand-rolled atomic operations. I think that generally, for single token-like contentions, atomic primitives will always still be the fastest way, regardless of how many cores you have.

As Cray discovered long time ago, there's no free lunch here. High level software design, where you try to use potentially contentious resources in as infrequent as possible will always lead to the biggest payout in massively parallelized applications. This means doing as much work as possible as the result of a lock acquisition, but as quickly as possible as well. In extreme situations, this can mean pre-calculating your work on the assumption of a successfully acquired lock, trying to grab it, and just completing as fast as possible on success, otherwise throwing away your work and retrying on fail.

Paul Hsieh
A: 

On a side note to this question, it is worth mentioning that the future you refer to is already present technology in GPUs. a modern quadro GPU has as much as 256 cores and can pefrorm atomic operations on the global (display) memory.
I'm not sure how this is achieved but the fact is that it's already happening.

shoosh
No, GPUs don't perform atomic operations on the display buffer. In fact, the fundamental difference between a modern GPGPU core and a CPU core is the GPU core is not cache coherent.
Louis Gerbarg
Not on the display buffer, but CUDA has a built in atomic operations for the video memory.
shoosh