I wrote a multi-threaded app to benchmark the speed of running LOCK CMPXCHG (x86 ASM).
On my machine (dual Core - Core 2), with 2 threads running and accessing the same variable, I can perform about 40M ops/second.
Then I gave each thread a unique variable to operate on. Obviously this means there's no locking contention between the threads, so I expected a speed performance. However, the speed didn't change. Why?