views:

158

answers:

4

I have a question about the following code sample (taken from: http://www.albahari.com/threading/part4.aspx#_NonBlockingSynch)

class Foo
{
   int _answer;
   bool _complete;

   void A()
   {
       _answer = 123;
       Thread.MemoryBarrier();    // Barrier 1
       _complete = true;
       Thread.MemoryBarrier();    // Barrier 2
   }

    void B()
    {
       Thread.MemoryBarrier();    // Barrier 3
       if (_complete)
       {  
          Thread.MemoryBarrier(); // Barrier 4
          Console.WriteLine (_answer);
       }
    }
 }

This is followed with the following explantion:

"Barriers 1 and 4 prevent this example from writing “0”. Barriers 2 and 3 provide a freshness guarantee: they ensure that if B ran after A, reading _complete would evaluate to true."

I understand how using the memory barriers affect the instruction reording, but what is this "freshness gurarantee" that is mentioned?

Later in the article, the following example is also used:

static void Main()
{
    bool complete = false; 
    var t = new Thread (() =>
    {
        bool toggle = false;
        while (!complete) 
        {
           toggle = !toggle;
           // adding a call to Thread.MemoryBarrier() here fixes the problem
        }

    });

    t.Start();
    Thread.Sleep (1000);
    complete = true;
    t.Join();  // Blocks indefinitely
}

This example is followed with this explanation:

"This program never terminates because the complete variable is cached in a CPU register. Inserting a call to Thread.MemoryBarrier inside the while-loop (or locking around reading complete) fixes the error."

So again ... what happens here?

A: 

Calling Thread.MemoryBarrier() immediately refreshes the register caches with the actual values for variables.

In the first example, the "freshness" for _complete is provided by calling the method right after setting it and right before using it. In the second example, the initial false value for the variable complete will be cached in the thread's own space and needs to be resynchronized in order to immediately see the actual "outside" value from "inside" the running thread.

Tamás Mezei
+1  A: 

Code sample #1:

Each processor core contains a cache with a copy of a portion of memory. It may take a bit of time for the cache to be updated. The memory barriers guarantee that the caches are synchronized with main memory. For example, if you didn't have barriers 2 and 3 here, consider this situation:

Processor 1 runs A(). It writes the new value of _complete to its cache (but not necessarily to main memory yet).

Processor 2 runs B(). It reads the value of _complete. If this value was previously in its cache, it may not be fresh (i.e., not synchronized with main memory), so it would not get the updated value.

Code sample #2:

Normally, variables are stored in memory. However, suppose a value is read multiple times in a single function: As an optimization, the compiler may decide to read it into a CPU register once, and then access the register each time it is needed. This is much faster, but prevents the function from detecting changes to the variable from another thread.

The memory barrier here forces the function to re-read the variable value from memory.

interjay
It is NOT the cache(s) that cause the problem. It is the reordering of reads and writes on the memory bus queue. This could happen without caches at all.Code that says "x = 10; y = 20; z = 30;" gets queued onto the memory bus. If the memory bus decides that it is more efficient to do z first, then it reorders the requests.Same with reads.Thinking about it as cache incoherency probably gives similar results, but that's not really what's going on.
tony
+4  A: 

In the first case, Barrier 1 ensures _answer is written BEFORE _complete. Regardless of how the code is written, or how the compiler or CLR instructs the CPU, the memory bus read/write queues can reorder the requests. The Barrier basically says "flush the queue before continuing". Similarly, Barrier 4 makes sure _answer is read AFTER _complete. Otherwise CPU2 could reorder things and see an old _answer with a "new" _complete.

Barriers 2 and 3 are, in some sense, useless. Note that the explanation contains the word "after": ie "... if B ran after A, ...". What's it mean for B to run after A? If B and A are on the same CPU, then sure, B can be after. But in that case, same CPU means no memory barrier problems.

So consider B and A running on different CPUs. Now, very much like Einstein's relativity, the concept of comparing times at different locations/CPUs doesn't really make sense. Another way of thinking about it - can you write code that can tell whether B ran after A? If so, well you probably used memory barriers to do that. Otherwise, you can't tell, and it doesn't make sense to ask. It's also similar to Heisenburg's Principle - if you can observe it, you've modified the experiment.

But leaving physics aside, let's say you could open the hood of your machine, and see that the actually memory location of _complete was true (because A had run). Now run B. without Barrier 3, CPU2 might STILL NOT see _complete as true. ie not "fresh".

But you probably can't open your machine and look at _complete. Nor communicate your findings to B on CPU2. Your only communication is what the CPUs themselves are doing. So if they can't determine BEFORE/AFTER without barriers, asking "what happens to B if it runs after A, without barriers" makes no sense.

By the way, I'm not sure what you have available in C#, but what is typically done, and what is really needed for Code sample # 1 is a single release barrier on write, and a single acquire barrier on read:

void A()
{
   _answer = 123;
   WriteWithReleaseBarrier(_complete, true);  // "publish" values
}

void B()
{
   if (ReadWithAcquire(_complete))  // subscribe
   {  
      Console.WriteLine (_answer);
   }
}

The word "subscribe" isn't often used to describe the situation, but "publish" is. I suggest you read Herb Sutter's articles on threading.

This puts the barriers in exactly the right places.

For Code sample #2, this isn't really a memory barrier problem, it is a compiler optimization issue - it is keeping complete in a register. A memory barrier would force it out, as would volatile, but probably so would calling an external function - if the compiler can't tell whether that external function modified complete or not, it will re-read it from memory. ie maybe pass the address of complete to some function (defined somewhere where the compiler can't examine its details):

while (!complete)
{
   some_external_function(&complete);
}

even if the function doesn't modify complete, if the compiler isn't sure, it will need to reload its registers.

ie the difference between code 1 and code 2 is that code 1 only has problems when A and B are running on separate threads. code 2 could have problems even on a single threaded machine.

Actually, the other question would be - can the compiler completely remove the while loop? If it thinks complete is unreachable by other code, why not? ie if it decided to move complete into a register, it might as well remove the loop completely.

EDIT: To answer the comment from opc (my answer is too big for comment block):

Barrier 3 forces the CPU to flush any pending read (and write) requests.

So imagine if there was some other reads before reading _complete:

void B {}
{
   int x = a * b + c * d; // read a,b,c,d
   Thread.MemoryBarrier();    // Barrier 3
   if (_complete)
   ...

Without the barrier, the CPU might have all of these 5 read requests 'pending':

a,b,c,d,_complete

Without the barrier, the processor could reorder these requests to optimize memory access (ie if _complete and 'a' were on the same cache line or something).

With the barrier, the CPU gets a,b,c,d back from memory BEFORE _complete is even put in as a request. ENSURING 'b' (for example) is read BEFORE _complete - ie no reordering.

The question is - what difference does it make?

If a,b,c,d are independent from _complete, then it doesn't matter. All the barrier does is SLOW THINGS DOWN. So yeah, _complete is read later. So the data is fresher. Putting a sleep(100) or some busy-wait for-loop in there before the read would make it 'fresher' as well! :-)

So the point is - keep it relative. Does the data need to be read/written BEFORE/AFTER relative to some other data or not? That's the question.

And to not put down the author of the article - he does mention "if B ran after A...". It just isn't exactly clear whether he is imagining that B after A is crucial to the code, observable by to code, or just inconsequential.

tony
so the memory barrier #3 actually forces the processor the get the value from the main memory instead of a register/cache?
A: 

The "freshness" guarantee simply means that Barriers 2 and 3 force the values of _complete to be visible as soon as possible as opposed to whenever they happen to be written to memory.

It's actually unnecessary from a consistency point of view, since Barriers 1 and 4 ensure that answer will be read after reading complete.

MSN