I think you are mixing together atomic memory access with cache coherence. The former is the required hardware support for building synchronization primitives in software (spin-locks, semaphores, and mutexes), while the latter is the hardware support for multiple chips (several CPUs, and peripheral devices) working over the same bus, and having consistent view of the main memory.
Different compilers/libraries provide different utilities for the first. Here's, for example, GCC intrinsics for atomic memory access. They all boil down to generating either compare-and-swap or load-linked/store-conditional based instruction blocks depending on the platform support. Compile your source with, say, -S
for GCC and see the assembler generated.
You don't have to do anything explicitly for cache coherency - it's all handled in hardware - but it definitely helps to understand how it works to avoid things like cache line ping-pong.
With all that, single word reads and writes are atomic on all commodity platforms (somebody correct me if I'm wrong here). Since int
s are less or equal to processor word in size, you are covered (see the GCC builtins link above).
It's the order of reads and writes that is important. Here's where architecture memory model is important. It dictates what operations can and cannot be re-ordered by the hardware. Example would be updating a linked list - you don't want other CPUs see a new item linked until the item itself is in consistent state. Explicit memory barriers (also often called "memory fences") might be required. Acquire barrier ensures that subsequent operations are not re-ordererd before the barrier (say you read the linked-list item pointer before the content of the item), Release barrier ensures that previous operations are not re-ordered after the barrier (you write the item content before writing the new link pointer).
volatile
is often misunderstood as being related to all the above. In fact it is just an instruction to the compiler not to cache variable value in register, but read it from memory on each access. Many argue that it's "almost useless" for concurrent programming.
Apologies for lengthy reply. Hope this clears it a bit.
Edit:
Upcoming C++0x standard finally addresses concurrency, see Hans Boehm's C++ memory model papers for many details.