views:

107

answers:

1

Hello all,

I'm reading Memory Barriers by Paul E. McKenney http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf everything is explained in great details and when I see that everything is clear I encounter one sentence, which stultifies everything and make me think that I understood nothing. Let me show the example

void foo(void)
{
   a = 1; #1
   b = 1; #2
}

void bar(void)
{
   while (b == 0) continue; #3
   assert(a == 1); #4
}

let's say this two functions are running on a different processors. Now what could possibly happen is store to a #1 could be seen after store to b #2 by the second processor, because first processor queues store to "a" and proceed to store b instruction. OK, that's fine, we add a write fence in the line between #1 and #2, but this code still can fail, because second processor might queue the invalidate message, so we add one more memory fence (read fence this time) in the line between #4 and #4.

void foo(void)
{
   a = 1; #1
   write_memory_barrier();
   b = 1; #2
}

void bar(void)
{
   while (b == 0) continue; #3
   read_memory_barrier();
   assert(a == 1); #4
}

this enforce second processor to process all queued messages (invalidate a) and read it again by sending read MESI message to first processor on #4. OK. Next the article says

Many CPU architectures therefore provide weaker memory-barrier instructions that do only one or the other of these two. Roughly speaking, a “read memory barrier” marks only the invalidate queue and a “write memory barrier” marks only the store buffer. while a full-fledged memory barrier does both.

Great, that's clear, but after that I see this

The effect of this is that a read memory barrier orders only loads on the CPU that executes it, so that all loads preceding the read memory barrier will appear to have completed before any load following the read memory barrier. Similarly, a write memory barrier orders only stores, again on the CPU that executes it, and again so that all stores preceding the write memory barrier will appear to have completed before any store following the write memory barrier.

so

all loads preceding the read memory barrier will appear to have completed before any load following the read memory barrier

that mixes up everything what was explained before. What does it mean? Which load in function "bar" have to complete before load of "a" #4? I understand the assert could fail without memory barrier in this function just because the processor may read an old value, because it still didn't manage to invalidate it's cache line, where object "a" is located.

Explanation in details would be really helpful, I'm trying to understand it all the day.

Thanks very much in advance.

+2  A: 

What does it mean?

It means that if you have:

read
read
read
READ BARRIER
read
read
read

then the read barrier acts as a "join point" dividing these reads into two batches. All the reads preceding the read barrier will have been done before any read following the read barrier is begun.

Which loads in bar() must complete before the load of a (#4) is begun?

All reads of b (#3) are forced to precede any read of a (#4). This means that a is not read till after b is no longer 0. Because foo() uses a write barrier to ensure that a has already been changed to 1 (#1) by the time that b is changed (#2). The two barriers thus work together to ensure the assert statement will always succeed.

Jeremy W. Sherman
Thanks for your answer, but if we have a look at example point 4.3 in the article we see an example, when actually all reads of b (#3) precede read of a without memory barrier and assert still fails, because CPU 1 executes the assert(a==1), and, sincethe old value of “a” is still in CPU 1’s cache,this assertion fails.
confucius
In the fixed code (with read barrier) CPU 1 executes the assert(a==1), and, sincethe cache line containing “a” is no longer inCPU 1’s cache (because read barrier forced to invalidate cache line), it transmits a "read" message.
confucius
So it's not just an ordering, right? I don't see how it could be explained with just saying that it forces the ordering. I think there is something that I don't understand, some fundamental detail, that doesn't allow me to join all the information I have and get the final picture.
confucius
I think the issue is that the wrap-up is at a higher level of abstraction. Forget temporarily about caching, and imagine that all reads are actually being carried out for the first time before/after the read barrier. That's the end result of invalidating the cache: the processor is forced to perform the read of `a` after it hits the read barrier. Without the read barrier, it's free to read `a` and then loop till `b` is non-zero. The barrier ensures that the read of `a` is not reordered before the barrier, i.e., before `b` is nonzero.
Jeremy W. Sherman
Does it mean, that this is just said to simplify the understanding of what is going on when you look at code? So this message queue, caching done by processor just results in appearance of memory operations out of order from single processor point of view? Like for example in function "bar" we can imagine, that processor reorders operations and read "a" before it starts reading "b"? So this is said just to do not think every time about caching and low level details of implementation?
confucius
Yes! I think that is exactly what is going on. This is the sense in which these are "barriers" at all - they block reordering from a simplified (uniprocessor) point of view. Otherwise, it would make more sense to call the operations "flush write queue" and "invalidate read cache".
Jeremy W. Sherman
By the way, you might find the article [*x86-TSO: A Rigorous and Usable Programmer's Memory Model for x86 Multiprocessors*](http://moscova.inria.fr/~zappa/readings/cacm10.pdf) by Owens, Sarkar, and Sewell, published in the May 2010 CACM, to be very helpful in thinking about these things.
Jeremy W. Sherman
There's also an [extended version of the `x86-TSO` paper](http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.pdf) that includes proof outlines and a verified checker. The one published in CACM was really just a "research highlight". That said, it's the version I read, and it was definitely worth the time. Read the version in the actual issue of CACM if you get the chance; the diagrams and tables are much prettier and more readable than they are in the final-draft preprint I linked to.
Jeremy W. Sherman
Thanks! That article looks interesting, I was searching for everything related to barriers, but I didn't find this one. Do you know any way of testing memory reordering? My compiler on x86 doesn't generate any barriers at all, maybe there is way to simulate it on Virtual Machine? How the tests are done if there is no ALPHA processor available?
confucius
The article should describe how they verified their model was accurate on actual x86 hardware. You could introduce barriers into your code by using `asm` statements.
Jeremy W. Sherman
Thank very much.
confucius