views:

182

answers:

2

I'm clear on the usage of MemoryBarrier, but not on what happens behind the scenes in the runtime. Can anyone give a good explanation of what goes on?

+1  A: 

While doing lock-free concurrent programming one should care about program instructions reordering.

Program instructions reordering can occur at several stages:

  1. C#/VB.NET/F# compiler optimizations
  2. JIT compiler optimizations
  3. CPU optimizations.

Memory fences are the only way to ensure particular order of your program instructions. Basically, memory fence is a class of instructions which causes CPU to enforce an ordering constraint. Memory fences can be put into three categories:

  1. Load fences - ensure no load CPU instructions move across the fences
  2. Store fences - ensure no store CPU instructions move across the fences
  3. Full fences - ensure no load or store CPU instructions move across the fences

In .NET Framework there are plenty of ways to emit fences: Interlock, Monitor, ReaderWriterLockSlim etc.

Thread.MemoryBarrier emits a full fence on both JIT compiler and processor level.

Vitaliy Liptchinsky
Your last sentence is what I'm looking for. I'm aware of what fences are and why they're needed, but how is it emitted on the JIT compiler and whats actually being output?
Kilhoffer
+3  A: 

In a really strong memory model, emitting fence instructions would be unnecessary. All memory accesses would execute in order and all stores would be globally visible.

Memory fences are needed because current common architectures do not provide a strong memory model - x86/x64 can for example reorder reads relative to writes. (A more thorough source is "Intel® 64 and IA-32 Architectures Software Developer’s Manual, 8.2.2 Memory Ordering in P6 and More Recent Processor Families"). As an example from the gazillions, Dekker's algorithm will fail on x86/x64 without fences.

Even if the JIT produces machine code in which memory loads and stores are carefully placed, its efforts are useless if the CPU then reorders these loads and stores.

Risking oversimplification: it may help to visualize the instruction stream with loads and stores as a thundering herd of wild animals. As they cross a narrow bridge (your CPU), you can never be sure about the order of the animals, since some of them will be slower, some faster, some overtake, some fall behind. If at the start - when you emit the machine code - you partition them into groups by putting infinitely long fences between them, you can at least be sure that group A comes before group B.

Fences ensure the ordering of reads and writes. Wording is not exact, but:

  • a store fence "waits" for all outstanding store (write) operations to finish, but does not affect loads.
  • a load fence "waits" for all outstanding load (read) operations to finish, but does not affect stores.
  • a full fence "waits" for all store and load operations to finish. It has the effect that reads and writes before the fence will get executed before the writes and loads that are on the "other side of the fence" (come later than the fence).

What the JIT emits for a full fence, depends on the (CPU) architecture and what memory ordering guarantees it provides. Since the JIT knows exactly what architecture it runs on, it can issue the proper instruction(s).

On my x64 machine, with .NET 4.0 RC, it happens to be a lock or.

            int a = 0;
00000000  sub         rsp,28h 
            Thread.MemoryBarrier();
00000004  lock or     dword ptr [rsp],0 
            Console.WriteLine(a);
00000009  mov         ecx,1 
0000000e  call        FFFFFFFFEFB45AB0 
00000013  nop 
00000014  add         rsp,28h 
00000018  ret 

Intel® 64 and IA-32 Architectures Software Developer’s Manual Chapter 8.1.2:

  • "...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor."

  • memory-ordering instructions address this specific need. MFENCE could have been used as full barrier in the above case (at least in theory - for one, locked operations might be faster, for two it might result in different behavior). MFENCE and its friends can be found in Chapter 8.2.5 "Strengthening or Weakening the Memory-Ordering Model".

There are some more ways to serialize stores and loads, though they are either impractical or slower than the above methods:

  • In chapter 8.3 you can find full serializing instructions like CPUID. These serialize instruction flow as well: "Nothing can pass a serializing instruction and a serializing instruction cannot pass any other instruction (read, write, instruction fetch, or I/O)".

  • If you set up memory as strong uncached (UC), it will give you a strong memory model: no speculative or out-of order accesses will be allowed and all accesses will appear on the bus, therefore no need to emit an instruction. :) Of course, this will be a tad slower than usual.

...

So it depends on. If there was a computer with strong ordering guarantees, the JIT would probably emit nothing.

IA64 and other architectures have their own memory models - and thus guarantees of memory ordering (or lack of them) - and their own instructions/ways to deal with memory store/load ordering.

andras
Excellent explanation. Good links to other resources as well. Thank you!
Kilhoffer