In the first case, Barrier 1 ensures _answer
is written BEFORE _complete
. Regardless of how the code is written, or how the compiler or CLR instructs the CPU, the memory bus read/write queues can reorder the requests. The Barrier basically says "flush the queue before continuing". Similarly, Barrier 4 makes sure _answer
is read AFTER _complete
. Otherwise CPU2 could reorder things and see an old _answer
with a "new" _complete
.
Barriers 2 and 3 are, in some sense, useless. Note that the explanation contains the word "after": ie "... if B ran after A, ...". What's it mean for B to run after A? If B and A are on the same CPU, then sure, B can be after. But in that case, same CPU means no memory barrier problems.
So consider B and A running on different CPUs. Now, very much like Einstein's relativity, the concept of comparing times at different locations/CPUs doesn't really make sense.
Another way of thinking about it - can you write code that can tell whether B ran after A? If so, well you probably used memory barriers to do that. Otherwise, you can't tell, and it doesn't make sense to ask. It's also similar to Heisenburg's Principle - if you can observe it, you've modified the experiment.
But leaving physics aside, let's say you could open the hood of your machine, and see that the actually memory location of _complete
was true (because A had run). Now run B. without Barrier 3, CPU2 might STILL NOT see _complete
as true. ie not "fresh".
But you probably can't open your machine and look at _complete
. Nor communicate your findings to B on CPU2. Your only communication is what the CPUs themselves are doing. So if they can't determine BEFORE/AFTER without barriers, asking "what happens to B if it runs after A, without barriers" makes no sense.
By the way, I'm not sure what you have available in C#, but what is typically done, and what is really needed for Code sample # 1 is a single release barrier on write, and a single acquire barrier on read:
void A()
{
_answer = 123;
WriteWithReleaseBarrier(_complete, true); // "publish" values
}
void B()
{
if (ReadWithAcquire(_complete)) // subscribe
{
Console.WriteLine (_answer);
}
}
The word "subscribe" isn't often used to describe the situation, but "publish" is. I suggest you read Herb Sutter's articles on threading.
This puts the barriers in exactly the right places.
For Code sample #2, this isn't really a memory barrier problem, it is a compiler optimization issue - it is keeping complete
in a register. A memory barrier would force it out, as would volatile
, but probably so would calling an external function - if the compiler can't tell whether that external function modified complete
or not, it will re-read it from memory. ie maybe pass the address of complete
to some function (defined somewhere where the compiler can't examine its details):
while (!complete)
{
some_external_function(&complete);
}
even if the function doesn't modify complete
, if the compiler isn't sure, it will need to reload its registers.
ie the difference between code 1 and code 2 is that code 1 only has problems when A and B are running on separate threads. code 2 could have problems even on a single threaded machine.
Actually, the other question would be - can the compiler completely remove the while loop? If it thinks complete
is unreachable by other code, why not? ie if it decided to move complete
into a register, it might as well remove the loop completely.
EDIT: To answer the comment from opc (my answer is too big for comment block):
Barrier 3 forces the CPU to flush any pending read (and write) requests.
So imagine if there was some other reads before reading _complete:
void B {}
{
int x = a * b + c * d; // read a,b,c,d
Thread.MemoryBarrier(); // Barrier 3
if (_complete)
...
Without the barrier, the CPU might have all of these 5 read requests 'pending':
a,b,c,d,_complete
Without the barrier, the processor could reorder these requests to optimize memory access (ie if _complete and 'a' were on the same cache line or something).
With the barrier, the CPU gets a,b,c,d back from memory BEFORE _complete is even put in as a request. ENSURING 'b' (for example) is read BEFORE _complete - ie no reordering.
The question is - what difference does it make?
If a,b,c,d are independent from _complete, then it doesn't matter. All the barrier does is SLOW THINGS DOWN. So yeah, _complete
is read later. So the data is fresher. Putting a sleep(100) or some busy-wait for-loop in there before the read would make it 'fresher' as well! :-)
So the point is - keep it relative. Does the data need to be read/written BEFORE/AFTER relative to some other data or not? That's the question.
And to not put down the author of the article - he does mention "if B ran after A...". It just isn't exactly clear whether he is imagining that B after A is crucial to the code, observable by to code, or just inconsequential.