views:

361

answers:

6

Previously I've written some very simple multithreaded code, and I've always been aware that at any time there could be a context switch right in the middle of what I'm doing, so I've always guarded access the shared variables through a CCriticalSection class that enters the critical section on construction and leaves it on destruction. I know this is fairly aggressive and I enter and leave critical sections quite frequently and sometimes egregiously (e.g. at the start of a function when I could put the CCriticalSection inside a tighter code block) but my code doesn't crash and it runs fast enough.

At work my multithreaded code needs to be a tighter, only locking/synchronising at the lowest level needed.

At work I was trying to debug some multithreaded code, and I came across this:

EnterCriticalSection(&m_Crit4);
m_bSomeVariable = true;
LeaveCriticalSection(&m_Crit4);

Now, m_bSomeVariable is a Win32 BOOL (not volatile), which as far as I know is defined to be an int, and on x86 reading and writing these values is a single instruction, and since context switches occur on an instruction boundary then there's no need for synchronising this operation with a critical section.

I did some more research online to see whether this operation did not need synchronisation, and I came up with two scenarios it did:

  1. The CPU implements out of order execution or the second thread is running on a different core and the updated value is not written into RAM for the other core to see; and
  2. The int is not 4-byte aligned.

I believe number 1 can be solved using the "volatile" keyword. In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.

Number 2 I cannot verify, I don't know why the byte alignment would make a difference. I don't know the x86 instruction set, but does mov need to be given a 4-byte aligned address? If not do you need to use a combination of instructions? That would introduce the problem.

So...

QUESTION 1: Does using the "volatile" keyword (implicity using memory barriers and hinting to the compiler not to optimise this code) absolve a programmer from the need to synchronise a 4-byte/8-byte on x86/x64 variable between read/write operations?

QUESTION 2: Is there the explicit requirement that the variable be 4-byte/8-byte aligned?

I did some more digging into our code and the variables defined in the class:

class CExample
{

private:

    CRITICAL_SECTION m_Crit1; // Protects variable a
    CRITICAL_SECTION m_Crit2; // Protects variable b
    CRITICAL_SECTION m_Crit3; // Protects variable c
    CRITICAL_SECTION m_Crit4; // Protects variable d

    // ...

};

Now, to me this seems excessive. I thought critical sections synchronised threads between a process, so if you've got one you can enter it and no other thread in that process can execute. There is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you.

I think the only thing that can change the variables from outside a critical section is if the process shares a memory page with another process (can you do that?) and the other process starts to change the values. Mutexes would also help here, named mutexes are shared across processes, or only processes of the same name?

QUESTION 3: Is my analysis of critical sections correct, and should this code be rewritten to use mutexes? I have had a look at other synchronisation objects (semaphores and spinlocks), are they better suited here?

QUESTION 4: Where are critical sections/mutexes/semaphores/spinlocks best suited? That is, which synchronisation problem should they be applied to. Is there a vast performance penalty for choosing one over the other?

And while we're on it, I read that spinlocks should not be used in a single-core multithreaded environment, only a multi-core multithreaded environment. So, QUESTION 5: Is this wrong, or if not, why is it right?

Thanks in advance for any responses :)

+4  A: 

Don't use volatile. It has virtually nothing to do with thread-safety. See here for the low-down.

The assignment to BOOL doesn't need any synchronisation primitives. It'll work fine without any special effort on your part.

If you want to set the variable and then make sure that another thread sees the new value, you need to establish some kind of communication between the two threads. Just locking immediately before assigning achieves nothing because the other thread might have come and gone before you acquired the lock.

One final word of caution: threading is extremely hard to get right. The most experienced programmers tend to be the least comfortable with using threads, which should set alarm bells ringing for anyone who is inexperienced with their use. I strongly suggest you use some higher-level primitives to implement concurrency in your app. Passing immutable data structures via synchronised queues is one approach that substantially reduces the danger.

Marcelo Cantos
I am not sure you are right. See http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx - volatile makes all the difference with regards to thread safety in the first example given.
taspeotis
Not necessarily so. There might still be visibility issues (one core not seeing the update made by another.) The locking primitives usually imply a suitable memory barrier which is required in order to make all cores see the changed value.
Dirk
But in VS2005 onwards the volatile keyword adds a memory barrier around the variable when you are using it?
taspeotis
RE: "Just locking immediately before assigning achieves nothing because the other thread might have come and gone before you acquired the lock." - I am aware of the double lock check pattern.
taspeotis
Microsoft has done weird things with volatile. The case in the article is related to two different variables, with no locking in between, which is usually asking for trouble, and, as pointed out in the article, it doesn't even work for VS2003. Trust this kind of shenanigans at your peril.
Marcelo Cantos
Also thank you for your responses so far
taspeotis
The problem with `volatile` is that to be able to play the trick in the article all variables that form the shared state must be `volatile`. Any datum that is not volatile can be reordered by the compiler (there are no memory barriers). If only one of the variables in the article were volatile, then the race condition would still be present. Changing all the state to volatile variables will hit performance in many cases (all operations on the data will imply actual memory, and writes will force other processors to reload the affected cache lines even if they are using a different datum...)
David Rodríguez - dribeas
http://www.drdobbs.com/high-performance-computing/212701484
David Rodríguez - dribeas
@David, thank you for your comment. I did not think of out of order execution for standard variables AROUND variables declared `volatile`. It is an excellent point.
taspeotis
+10  A: 

1) No volatile just says re-load the value from memory each time it is STILL possible for it to be half updated.

Edit: 2) Windows provides some atomic functions. Look up the "Interlocked" functions.

The comments led me do a bit more reading up. If you read through the Intel System Programming Guide You can see that there aligned read and writes ARE atomic.

8.1.1 Guaranteed Atomic Operations The Intel486 processor (and newer processors since) guarantees that the following basic memory operations will always be carried out atomically:
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) guarantees that the following additional memory operations will always be carried out atomically:
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit data bus
The P6 family processors (and newer processors since) guarantee that the following additional memory operation will always be carried out atomically:
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit within a cache line
Accesses to cacheable memory that are split across bus widths, cache lines, and page boundaries are not guaranteed to be atomic by the Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, P6 family, Pentium, and Intel486 processors. The Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium M, Pentium 4, Intel Xeon, and P6 family processors provide bus control signals that permit external memory subsystems to make split accesses atomic; however, nonaligned data accesses will seriously impact the performance of the processor and should be avoided. An x87 instruction or an SSE instructions that accesses data larger than a quadword may be implemented using multiple memory accesses. If such an instruction stores to memory, some of the accesses may complete (writing to memory) while another causes the operation to fault for architectural reasons (e.g. due an page-table entry that is marked “not present”). In this case, the effects of the completed accesses may be visible to software even though the overall instruction caused a fault. If TLB invalidation has been delayed (see Section 4.10.3.4), such page faults may occur even if all accesses are to the same page.

So basically yes if you do an 8-bit read/write from any address a 16-bit read/write from a 16-bit aligned address etc etc you ARE getting atomic operations. Its also interesting to note that you can do unaligned memory read/writes within a cacheline on a modern machine. The rules seem quite complex though so I wouldn't rely on them if i were you. Cheers to the commenters thats a good learning experience for me that one :)

3) A critical section will attempt to spin lock for its lock a few times and then locks a mutex. Spin Locking can suck CPU power doing nothing and a mutex can take a while to do its stuff. CriticalSections are a good choice if you can't use the interlocked functions.

4) There are performance penalties for choosing one over another. Its a pretty big ask to go through the benefits of everything here. The MSDN help has lots of good information on each of these. I sugegst reading them.

5) You can use a spin lock in a single threaded environment its not usually necessary though as thread management means that you can't have 2 processors accessing the same data simultaneously. It just isn't possible.

Goz
Thank you for the response. With regards to #2, can you please explain why Microsoft has an example that does not require Interlocked* functions when using the "volatile" keyword with VS2005 onwards: http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx. They show that you need the Interlocked* family of functions for VS2003 and earlier.And how is it possible for the value to be half updated? I thought reading/writing from memory was one instruction, and that the CPU could synchronise reads/writes of 4 bytes to system RAM?
taspeotis
32-bit ints _are_ atomic on x86 (and 64-bit ints are atomic on x64). The interlocked functions are used when you want to do more than just read or write.
Marcelo Cantos
So they aren't or they are atomic on x86?
taspeotis
just something to add to Marcelos comment: a single read of or a single write to a 32bit int on an x86 is atomic. Reading the value of an int, changing it and writing it back isn't. If volatile (or memory barriers) are enough protection for your shared variables depends entirely on how you use them. Volatile only prevents reordering of access and caching (on VS2005). It does not make an operation atomic.
Tobias Langner
reads/writes certainly *are* atomic on x86. The interlocked instructions serve different purposes
jalf
@Tobias: I believe that `volatile` only prevents reordering with instructions that affect `volatile` variables, but not with instructions on non-volatile variables.
David Rodríguez - dribeas
So why the downvotes?
Goz
@David: I think you're wrong here but correct me if I did not get it right. From the document linked above: "With Visual Studio 2005, the compiler uses acquire semantics for read operations on volatile variables and release semantics for write operations on volatile variables (when supported by the CPU)." "With acquire semantics, the results of the operation are available before the results of any operation that appears after it in code." Analog for release semantics. The "any operation" means for me that it actually prevents reordering here.
Tobias Langner
@Tobias: First, that is MS implementation, not what `volatile` means. Then, the memory barrier is not full, so in the case of the acquire, it will inhibit the compiler from moving later instructions before the `volatile` read, but it will not inhibit pushing below the read any operation that appears before it. Whether that affects the outcome of your program is a different question. And I insist: this is not what `volatile` means, just how it is implemented in VS2005 and over. I would write preprocessor instructions to fail building with other compilers if you follow that path...
David Rodríguez - dribeas
@david: yes, you are right. It's only VS2005 and you have to make sure that you don't use any other compiler.
Tobias Langner
+2  A: 

Volatile does not imply memory barriers.

It only means that it will be part of the perceived state of the memory model. The implication of this is that the compiler cannot optimize the variable away, nor can it perform operations on the variable only in CPU registers (it will actually load and store to memory).

As there are no memory barriers implied, the compiler can reorder instructions at will. The only guarantee is that the order in which different volatile variables are read/write will be the same as in the code:

void test() 
{
    volatile int a;
    volatile int b;
    int c;

    c = 1;
    a = 5;
    b = 3;
}

With the code above (assuming that c is not optimized away) the update to c can happen before or after the updates to a and b, providing 3 possible outcomes. The a and b updates are guaranteed to be performed in order. c can be optimized away easily by any compiler. With enough information, the compiler can even optimize away a and b (if it can be proven that no other threads read the variables and that they are not bound to a hardware array (so in this case, they can in fact be removed). Note that the standard does not require an specific behavior, but rather a perceivable state with the as-if rule.

David Rodríguez - dribeas
Thank you for the response. As I mention above, "volatile" protects a variable with memory barriers if you're using VS2005 or later. See http://msdn.microsoft.com/en-us/library/ms686355(VS.85).aspx.I am very curious to see what the standard has to say, however. Do you have a link to the standard? I can only find articles that reference the standard.Also I am aware that volatile implying memory barriers is only for VS2005+, so can I treat it as a poor man's memory barrier as long as I am only targeting these compilers, or is there a "proper" way to do it on VS2005+ (ignoring portability)
taspeotis
@taspeotis: The standard has no concept of threading, so it doesn't guarantee memory barriers. The *proper* way to do memory barriers is always to specify a memory barrier. (In GCC, I believe it is __synchronize(), and under Windows it is __ReadWriteBarrier()). Don't rely on `volatile` going that extra mile. it does so in VS2005 and 2008, but not under other compilers, and there's no guarantee it'll do so in future versions of VS either.
jalf
@taspeotis, it's MS-specific feature. In fact, I was very recently bitten by GCC reordering regular var reads/writes after a volatile var write, resulting in a very nasty heisenbug.
Alex B
+6  A: 

1: Volatile in itself is practically useless for multithreading. It guarantees that the read/write will be executed, rather than storing the value in a register, and it guarantees that the read/write won't be reordered with respect to other volatile reads/writes. But it may still be reordered with respect to non-volatile ones, which is basically 99.9% of your code. Microsoft have redefined volatile to also wrap all accesses in memory barriers, but that is not guaranteed to be the case in general. It will just silently break on any compiler which defines volatile as the standard does. (The code will compile and run, it just won't be thread-safe any longer)

Apart from that, reads/writes to integer-sized objects are atomic on x86 as long as the object is well aligned. (You have no guarantee of when the write will occur though. The compiler and CPU may reorder it, so it's atomic, but not thread-safe)

2: Yes, the object has to be aligned for the read/write to be atomic.

3: Not really. Only one thread can execute code inside a given critical section at a time. Other threads can still execute other code. So you can have four variables each protected by a different critical section. If they all shared the same critical section, I'd be unable to manipulate object 1 while you're manipulating object 2, which is inefficient and constrains parallelism more than necessary. If they are protected by different critical sections, we just can't both manipulate the same object simultaneously.

4: Spinlocks are rarely a good idea. They are useful if you expect a thread to have to wait only a very short time before being able to acquire the lock, and you absolutely neeed minimal latency. It avoids the OS context switch which is a relatively slow operation. Instead, the thread just sits in a loop constantly polling a variable. So higher CPU usage (the core isn't freed up to run another thread while waiting for the spinlock), but the thread will be able to continue as soon as the lock is released.

As for the others, the performance characteristics are pretty much the same: just use whichever has the semantics best suited for your needs. Typically critical sections are most convenient for protecting shared variables, and mutexes can be easily used to set a "flag" allowing other threads to proceed.

As for not using spinlocks in a single-core environment, remember that the spinlock doesn't actually yield. Thread A waiting on a spinlock isn't actually put on hold allowing the OS to schedule thread B to run. But since A is waiting on this spinlock, some other thread is going to have to release that lock. If you only have a single core, then that other thread will only be able to run when A is switched out. With a sane OS, that's going to happen sooner or later anyway as part of the regular context switching. But since we know that A won't be able to get the lock until B has had a time to executed and release the lock, we'd be better off if A just yielded immediately, was put in a wait queue by the OS, and restarted when B has released the lock. And that's what all other lock types do. A spinlock will still work in a single core environment (assuming an OS with preemptive multitasking), it'll just be very very inefficient.

jalf
A: 

Questions 3: CRITICAL_SECTIONs and Mutexes work, pretty much, the same way. A Win32 mutex is a kernel object, so it can be shared between processes, and waited on with WaitForMultipleObjects, which you can't do with a CRITICAL_SECTION. On the other hand, a CRITICAL_SECTION is lighter-weight and therefore faster. But the logic of the code should be unaffected by which you use.

You also commented that "there is no need for a critical section for each variable you want to protect, if you're in a critical section then nothing else can interrupt you." This is true, but the tradeoff is that accesses to any of the variables would need you to hold that lock. If the variables can meaningfully be updated independently, you are losing an opportunity for parallelising those operations. (Since these are members of the same object, though, I would think hard before concluding that they can really be accessed independently of each other.)

Andy Mortimer
A: 

Q1:

In VS2005 and later the C++ compiler surrounds access to this variable using memory barriers, ensuring that the variable is always completely written/read to the main system memory before using it.

Exactly. If you are not creating portable code, Visual Studio implements it exactly this way. If you want to be portable, your options are currently "limited". Until C++0x there is no portable way how to specify atomic operations with guaranteed read/write ordering and you need to implement per-platform solutions. That said, boost already did the dirty job for you, and you can use its atomic primitives.

Q2:

If you do keep them aligned, you are safe. If you do not, rules are complicated (cache lines, ...), therefore the safest way is to keep them aligned, as this is easy to achieve.

Q3:

Critical section is a lightweight mutex. Unless you need to synchronize between processes, use critical sections.

Q4:

Critical sections can even do spin waits for you.

Q5:

Spin lock uses the fact that while the waiting CPU is spinning, another CPU may release the lock. This cannot happen with one CPU only, therefore it is only a waste of time there. On multi-CPU spin locks can be good idea, but it depends on how often the spin wait will be successful. The idea is waiting for a short while is a lot faster then doing context switch there and back again, therefore if the wait it likely to be short, it is better to wait.

Suma