views:

81

answers:

3

It's my understanding that if two threads are reading from the same piece of memory, and no thread is writing to that memory, then the operation is safe. However, I'm not sure what happens if one thread is reading and the other is writing. What would happen? Is the result undefined? Or would the read just be stale? If a stale read is not a concern is it ok to have unsynchronized read-write to a variable? Or is it possible the data would be corrupted, and neither the read nor the write would be correct and one should always synchronize in this case?

I want to say that I've learned it is the later case, that a race on memory access leaves the state undefined... but I don't remember where I may have learned that and I'm having a hard time finding the answer on google. My intuition is that a variable is operated on in registers, and that true (as in hardware) concurrency is impossible (or is it), so that the worst that could happen is stale data, i.e. the following:

WriteThread: copy value from memory to register
WriteThread: update value in register
ReadThread:  copy value of memory to register
WriteThread: write new value to memory

At which point the read thread has stale data.

+2  A: 

The result is undefined. Corrupted data is entirely possible. For an obvious example, consider a 64-bit value being manipulated by a 32-bit processor. Let's assume the value is a simple counter, and we increment it when the lower 32-bits contain 0xffffffff. The increment produces 0x00000000. When we detect that, we increment the upper word. If, however, some other thread read the value between the time the lower word was incremented and the upper word was incremented, they get a value with an un-incremented upper word, but the lower word set to 0 -- a value completely different from what it would have been either before or after the increment is complete.

Jerry Coffin
That makes perfect sense. Thank you very much. Not to nitpick, but f or the sake of understanding, let's say that I know for a fact I'm on a 32-bit processor and that my variable is 32bits (or less). If only one thread is writing, and other threads are reading, will there still be the potential for data corruption? In other words, is the example you provided the *only* problem?
cheshirekow
If your computer offers 32 bit atomic writes, and only one thread writes, then the value seen by the other thread is either the old value or the new value. This is the basis of Dekker's algorithm.
Ira Baxter
If it is *exactly* 32-bits, no. If it's less, perhaps. Also, this guarantee is only for x86, PPC or ARM might not be as charitable
Paul Betts
Yes, it's still possible, at least on x86, if your 32-bit item is improperly aligned so it crosses a cache line. I think PPC and ARM prevent that (not by keeping the access atomic, but by prohibiting such misaligned accesses altogether).
Jerry Coffin
unix provides a sig_atomic_t type which is guaranteed to be an atomic write. I'm sure windows has something similar. You can call sizeof(sig_atomic_t) to find out the size on a particular platform.
Paul Rubel
@paulrubel: `sig_atomic_t` is actually provided by the C standard, so it's available on virtually all platforms. Unfortunately, it doesn't mean much -- it's guaranteed to be atomic in the presence of asynchronous interrupts, but *not* necessarily threads. The two may be the same, but then again they may not.
Jerry Coffin
+6  A: 

Usually memory is read or written in atomic units determined by the CPU architecture (32 bit and 64 bits item aligned on 32 bit and 64 bit boundaries is common these days).

In this case, what happens depends on the amount of data being written.

Let's consider the case of 32 bit atomic read/write cells.

If two threads write 32 bits into such an aligned cell, then it is absolutely well defined what happens: one of the two written values is retained. Unfortunately for you (well, the program), you don't know which value. By extremely clever programming, you can actually use this atomicity of reads and writes to build synchronization algorithms (e.g., Dekker's algorithm), but it is faster typically to use architecturally defined locks instead.

If two threads write more than an atomic unit (e.g., they both write a 128 bit value), then in fact the atomic unit sized pieces of the values written will be stored in a absolutely well defined way, but you won't know which pieces of which value get written in what order. So what may end up in storage is the value from the first thread, the second thread, or mixes of the bits in atomic unit sizes from both threads.

Similar ideas hold for one thread reading, and one thread writing in atomic units, and larger.

Basically, you don't want to do unsynchronized reads and writes to memory locations, because you won't know the outcome, even though it may be very well defined by the architecture.

Ira Baxter
Keep in mind that unless you use "special" ways of writing to the variables (ie, `InterlockedXxx` routines), on multicore systems you could end up with different values for the variables because cpu cache hasn't been sync'ed.
snemarch
Er, I thought the x86 at least offered coherent caches even on multicore systems. I don't know as much about ARM or PPC but incoherent caches make programming that much harder. (I think the super computer scale guys give up on coherence for the sheer parallelism if distributed cores.)
Ira Baxter
A: 

As I hinted in Ira Baxter's answer, CPU cache also plays a part on multicore systems. Consider the following test code:

DANGER WILL ROBISON!

The following code boosts priority to realtime to achieve somewhat more consistent results - while doing so requires admin privileges, be careful if running the code on dual- or single-core systems, since your machine will lock up for the duration of the test run.

#include <windows.h>
#include <stdio.h>

const int RUNFOR = 5000;
volatile bool terminating = false;
volatile int value;

static DWORD WINAPI CountErrors(LPVOID parm)
{
    int errors = 0;
    while(!terminating)
    {
        value = (int) parm;
        if(value != (int) parm)
            errors++;
    }
    printf("\tThread %08X: %d errors\n", parm, errors);
    return 0;
}

static void RunTest(int affinity1, int affinity2)
{
    terminating = false;
    DWORD dummy;
    HANDLE t1 = CreateThread(0, 0, CountErrors, (void*)0x1000, CREATE_SUSPENDED, &dummy);
    HANDLE t2 = CreateThread(0, 0, CountErrors, (void*)0x2000, CREATE_SUSPENDED, &dummy);

    SetThreadAffinityMask(t1, affinity1);
    SetThreadAffinityMask(t2, affinity2);
    ResumeThread(t1);
    ResumeThread(t2);

    printf("Running test for %d milliseconds with affinity %d and %d\n", RUNFOR, affinity1, affinity2);
    Sleep(RUNFOR);
    terminating = true;
    Sleep(100); // let threads have a chance of picking up the "terminating" flag.
}

int main()
{
    SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
    RunTest(1, 2);      // core 1 & 2
    RunTest(1, 4);      // core 1 & 3
    RunTest(4, 8);      // core 3 & 4
    RunTest(1, 8);      // core 1 & 4
}

On my Quad-core intel Q6600 system (which iirc has two sets of cores where each set share L2 cache - would explain the results anyway ;)), I get the following results:

Running test for 5000 milliseconds with affinity 1 and 2
        Thread 00002000: 351883 errors
        Thread 00001000: 343523 errors
Running test for 5000 milliseconds with affinity 1 and 4
        Thread 00001000: 48073 errors
        Thread 00002000: 59813 errors
Running test for 5000 milliseconds with affinity 4 and 8
        Thread 00002000: 337199 errors
        Thread 00001000: 335467 errors
Running test for 5000 milliseconds with affinity 1 and 8
        Thread 00001000: 55736 errors
        Thread 00002000: 72441 errors
snemarch
If this is an x86, what you are demonstrating here is "two threads doing unsynchronized atomic writes" not cache incoherency.
Ira Baxter
@Ira Baxter: isn't that exactly cache incoherency, though, since the instructions don't include a LOCK prefix, and caches thus aren't synchronized?
snemarch
@snemarch: I don't know what you mean by "unsynchronized". x86 cpus ensure cache coherency by the MESI protocol (go check the x86 architecture manuals); if one CPU attempts to touch a location held in a cache by another, the MESI protocol ensures that the cache line held by a cpu is transferred to the requesting cpu so that all CPUs see a "coherent" view of each memory location. LOCK forces the lock-issuing CPU to hold the cache line until it is done with, usually only a matter of a read and write.
Ira Baxter
@Ira Baxter: did a little reading, and it seems I indeed misunderstood the LOCK prefix - my bad. How is the factor ~10 difference between the affinity scenarios explained, though?
snemarch
@snemarch: Which value you get depends on the relative timing of the writes. Changing processor affinity changes that timing relationship. If your CPUs are hyperthreaded, you can get really weird effects on timing because your programs are competing for the same underlying physical processor.
Ira Baxter
@snemarch: your program is still a nice way to demonstrate the problem asked by the OP.
Ira Baxter
@Ira Baxter: Q6600 isn't hyperthreaded - it's a quadcore with a full set of execution units per core. Iirc the design is basically "two dualcores on a die", where each dualcore shares a block of L2 cache, and some fast interconnect. But, let me repeat, it's four physical cores, not HT.
snemarch
And while the timings differ a fair amount per run (you'd need to do a bare-metal "OS" to avoid that), the factor ~10x between the affinity groupings is consistent between runs.
snemarch
@snemarch: as you've noted, you have two dualcores. That's simply not perfectly symmetric, so you shouldn't expect perfectly symmetric results from different processor assignments.
Ira Baxter