views:

556

answers:

12

I recently stumbled upon this Wikipedia article. From my experience with multi-threading I am aware of the multitude of issues caused by the program being able to switch threads between threads at any time. However, I never knew that compiler and hardware optimisations could reorder operations in a way that is guaranteed to work for a single thread, but not necessarily for multi-threading. Can anyone explain how to correctly deal with the possibility of reordered operations in a multi-threaded environment?

UPDATE: I originally had accidentally linked to the Out-of-Order Execution article instead of the Memory barrier article, which has a better explanation of the problem.

+1  A: 

How do you prevent the possibility out of execution functions occurring and blowing up in your face?

You don't - the compiler can only change the order of execution when doing so doesn't alter the end result.

Anon.
That's not strictly true; on Windows and at least two other major gaming platforms the C/C++ compilers provide memory ordering intrinsics that prevent compiler reordering. (As well as CPU reordering.)
MSN
They are only necessary when the order of instructions matters as well as the end result (for example, in locking algorithms).
Anon.
The compiler may make any optimisation that would work for a single thread, but then when multithreading is enabled, these optimisations fail. http://en.wikipedia.org/wiki/Memory_barrier
Casebash
This is true given the assumptions the compiler is making. In most cases, the compiler assumes single-threaded operation except for things that you explicitly mark as being visible to concurrent operations, e.g., volatile in C/C++.
Jamey Hicks
A: 

You should not be running things that need to happen in order in different threads. Threads are for processing things in parallel, so if the order is important, it needs to be done serially.

Nate Heinrich
It's very rare for processes to be entirely parallel. Usually there needs to be some communication, if only to gather the results of computation.
Pete Kirkham
It's OK to have an ordering dependence in multiple threads, as long as you synchronize to enforce the ordering. Producer-consumer parallelism is a common pattern but requires synchronization between the producer and consumer.
Jamey Hicks
A: 

Most compilers nowadays have explicit ordering intrinsics. C++0x has memory ordering intrinsics as well.

MSN
Explicit ordering intrinsics?
Casebash
For C++0x: http://www.devx.com/SpecialReports/Article/38883/1954
MSN
+2  A: 

The compiler does not generate out of execution errors, it optimizes and reorders however it likes as long as what it produces yields the result your source code says it should.

But in dealing with multihreading, this can indeed blow up, though that generally has little to do with how the compiler has reordered your code (though it could make it worse in otherwise optimistic cases.

Dealing with threads operating on the same data, you need to be very, very careful and make sure your data is properly guarded with the appropriate guarding (semaphores/mutexes/atomic operations and similar)

nos
Could you elaborate on the circumstances when you need guards to prevent this problem?
Casebash
Basically any time you have more than 1 thread accessing the same data.
nos
+1  A: 

The factor of the matter is that if you're only just starting to deal with multithreaded code (to the point that you're explicitly talking about thread scheduling as if it's somewhat scary [not to say it isn't, but for different reasons]), this is happening at a much, much lower level than you need to worry about. As others have said, compilers will not do this if it cannot guarantee correctness, and while it's good to know that technologies like this exist, unless you're writing your own compiler or doing really bare metal stuff, it shouldn't present an issue.

Marc Bollinger
+2  A: 

Let's be clear - out of order execution refers to the processor execution pipeline not to the compiler per se as your link clearly demonstrates.
Out of order execution is a strategy employed by most modern CPU pipelines that allows them to re-order instructions on the fly to typically minimise read/write stalls which is the most common bottleneck on modern hardware due to the disparity between CPU execution speeds and memory latency ( i.e how fast my processor can fetch and process compared to how fast I can update the result back to RAM ).
So this is primarily a hardware feature not a compiler feature.
You can override this feature if you know what you're doing typically by use of memory barriers. Power PC has a wonderfully named instruction called eieio ( enforce in order execution of i/o) that forces the CPU to flush all pending reads and writes to memory - this is particularly important with concurrent programming ( whether that be multi-threaded or multi-processor ) as it ensures that all CPUs or threads have synchronised the value of all memory locations.
If you want to read about this in depth then this PDF is an excellent ( though detailed ) introduction.
HTH

zebrabox
Thanks, the PDF does look pretty interesting on the hardware level, but it doesn't seem to answer the question on how to deal with the problem.
Casebash
+3  A: 

It's not the compiler, it's the CPU. (Well both actually, but the CPU is the harder to control.) Regardless of how your code gets compiled, the CPU will look-ahead in the instruction stream and execute things out of order. Typically, for example, start a read early, since memory is slower than the CPU. (ie start it early, hope the read is done before you actually need it)

Both the CPU and the compiler optimize based on the same rule: reorder anything as long as it doesn't affect the results of the program * assuming a single-threaded single-processor environment *.

So there's the problem - it optimizes for single-threadedness, when it isn't. Why? Because otherwise everything would be 100x slower. really. And most of your code is singlethreaded (ie single-threaded-interaction) - only small parts need to interact in a multi-threaded way.

The best/easiest/safest way to control this is with locks - mutexes, semaphores, events, etc.

Only if you really, really, need to optimize (based on careful measurement), then you can look into memory barriers and atomic operations - these are the underlying instructions that are used to build mutexes etc, and when used correctly limit out-of-order execution.

But before doing that kind of optimization, check that the algorithms and code-flow are correct and whether you could further minimize multi-threaded interactions.

tony
I'm not worrying too much about performance ATM, just how to write code that correctly deals with these optimisations. I kind of just stumbled onto the memory barrier article.
Casebash
Then just use locks.Any variable that is shared between threads must be written AND read within a lock.
tony
I suppose a lock would prevent the compiler rearranging the operations
Casebash
Yes locks prevent compiler reordering as well. Typically not with much magic but just because any call to an opaque function (such as lock/unlock) forces the compiler to reload globals and avoid other optimizations because it doesn't know what that opaque function might change.
tony
Unless you run -O3 where gcc will break everything not marked with `volatile` ;)
Earlz
+7  A: 

Let me ask a question: Given a program code (say it is a single-threaded application), what is the correct execution? Intuitively, executing by CPU in-order as code specifies would be correct. This illusion of sequential execution is what programmers have.

However, modern CPU doesn't obey such restriction. Unless dependences are violated (data dependence, control dependence, and memory dependence), CPUs are executing instructions in out-of-order fashion. However, it is completely hidden to programmers. Programmers can never see what is going on inside of the CPU.

Compilers also exploit such fact. If the program's semantics (i.e., the inherent dependences in your code) can be preserved, compilers would reorder any possible instruction to achieve better performance. One notable optimization is code hoisting: compilers may hoist load instruction to minimize memory latency. But, don't worry, compilers guarantee its correctness; In any case, compilers will NOT crash your program due to such instructing reordering since compilers must preserve dependences at least. (But, compilers might have bugs :-)

If you're only considering single-thread application, you do not need to worry about such out-of-order execution either by compilers and CPUs, for a single-thread case.

(To learn more, I recommend you to take a look at the concept of ILP(instruction-level parallelism). Single thread performance is mostly dependent on how much ILP you can extract from a single thread. So, both CPUs and compilers do whatever they can do for better performance.)

However, when you consider multithreaded execution, then it has a potential problem called memory consistency problem. Intuitively programmers have a concept of sequential consistency. However, modern multi-core architectures are doing dirty and aggressive optimizations (e.g., caches and buffers). It is hard to implement sequential consistency with low-overheads in modern computer architecture. So, there could be very confusing situation due to out-of-order executions of memory loads and stores. You may observe some loads and stores had been executed in out-of-order. Read some articles related to relaxed memory models such as Intel x86's memory model (Read Chapter 8, Memory Ordering, of Volume 3A of Intel 64 and IA-32 Architectures Software Developer’s Manual). Memory barriers are needed in this situation where you have to enforce orders of memory instructions for the correctness.

THE ANSWER TO THE QUESTION: It's not easy to answer for this question in short. There is no good tools that detects such out-of-order and problematic behaviors due to memory consistency model (though there are research papers). So, in short, it is even hard for you to find such bugs exist in your code. However, I strongly suggest you to read articles on double-checked locking and its detailed paper. In a double-checked locking, due to relaxed memory consistency and compilers' reordering (note that compilers do not aware multi-threaded behaviors unless you explicitly specify with memory barriers), it may lead misbehavior.

In sum:

  • If you're only working on a single-threaded program, then you don't need to worry about out-of-order behaviors.
  • On multi-core, you may need to consider memory consistency problems. But, it's actually rare when you really need to worry about memory consistency issue. Mostly, data races, deadlocks, and atomicity violations kill your multi-threaded program.
minjang
You don't ever actually answer the question. You're just elaborating what the OP already stated in his question.
RBarryYoung
@RBarryYoung, I elaborate some principles about that and also answered that in a single thread, you don't need to worry, but in a multithread, you *may* worry about it due to relaxed memory consistency.
minjang
I agree that it doesn't really answer the question, but it contains useful information so +1
Casebash
@Casebash, thanks. So, I added some answers :D Honestly speaking, I can't give a simple answer such a big question. I gave this long posting (not answer!) to give some background which is critical to understand the problem. I don't think there is a simple answer for your question. It really depends on situation.
minjang
+1  A: 

However, I never knew that compiler and hardware optimisations could reorder operations in a way that is guaranteed to work for a single thread, but not necessarily for multi-threading.

As neither C nor C++ have had a strongly defined memory model, compilers could reorder optimisations which might cause issues for multi-threading. But as for compilers which are designed for use in multi-threaded environments, they don't.

Multi-threaded code either writes to memory, and uses a fence to ensure visibility of the writes between threads, or it uses atomic operations.

Since the values used in the atomic operation case are observable in a single thread, the reordering does not effect it - they have to have been calculated correctly prior to the atomic operation.

Compliers intended for multi-threaded applications do not reorder across memory fences.

So the reordering either does not effect the behaviour, or is suppressed as a special case.

If you are already writing correct multi-threaded code, the compiler reordering doesn't matter. It's only an issue if the compiler isn't aware of memory fences, it which case you probably shouldn't be using it to write multi-threaded code in the first place.

Pete Kirkham
In many cases, the same compilers are used in single-threaded and multi-threaded programs and, if you are not careful, they may re-order operations in a way that is noticeable in multithreaded programs.In C, the volatile attribute is used to denote a memory location that is visible to other threads. C compilers will not reorder volatile memory reads and writes.
Jamey Hicks
@Jamey Hicks you seem to have missed the point. A given compiler is either targeted at a multithreaded *environment* or it is not. Whether you happen to write multithreaded programs with said compiler doesn't matter. Please read the C specification for the semantics of volatile - it does not mention threads. 'If you are not careful' implies you are not using the memory barriers and atomic operations correctly. The point I was making is that if you are careful to avoid the large reorderings you get between threads, then it already avoids the weaker reorderings the compiler can create.
Pete Kirkham
I think the whole point of this question is whether and how one writes thread-safe code. If the first sentence were rewritten to say that a compiler will not reorder memory operations in thread-safe code in a way that changes the behavior of the program, I would be in complete agreement. As written, it sounds to me like you are implying it won't do so even if the code is not thread-safe.
Jamey Hicks
The OP says he knows about the issues writing multi-threaded code, but is worried about the compiler/CPU reordering the instructions within each thread. The compilers which target multithreaded environments will not introduce any reordering of the operations within a thread which will break code which already handles possible reorderings of operations between thread using fences, and the memory fences also enforce the order at the CPU level. So if he knows how to overcome the issues of multithreaded code, and has checked that the compiler docs, there is nothing more that needs doing.
Pete Kirkham
+1  A: 

The compiler and the cpu both implement algorithms which ensure that sequential semantics are preserved for a given execution stream. For them to not implement said algorithms qualifies as a bug. It is safe to assume that instruction reordering will not affect your program semantics.

As noted elsewhere, memory is the only place where non-sequential semantics may arise; synchronization to sequentialisms can be obtained there via various well-known mechanisms(at the assembly level, there are atomic memory access instructions; higher level functions such as mutexes, barriers, spinlocks, etc. are implemented with atomic assembly instructions).

In answer to your title: You don't handle OOO execution.

Paul Nathan
+1  A: 

I will address your question as one about multithreading in a high-level language, rather than discussing CPU pipeline optimization.

Can anyone explain how to correctly deal with the possibility of reordered operations in a multi-threaded environment?

Most, if not all, modern high-level multithreaded languages provide constructs for managing this potential for the compiler to reorder the logical execution of instructions. In C#, these include field-level constructs (volatile modifier), block-level constructs (lock keyword), and imperative constructs (Theading.MemoryBarrier).

Applying volatile to a field causes all access to that field in the CPU/memory to be executed in the same relative order in which it occurs in the instruction sequence (source code).

Using lock around a block of code causes the enclosed instruction sequence to be executed in the same relative order in which it occurs in the parent block of code.

The Threading.MemoryBarrier method indicates to the compiler that the CPU must not reorder memory access around this point in the instruction sequence. This enables a more advanced technique for specialized requirements.

The techniques above are described in order of increasing complexity and performance. As with all concurrency programming, determining when and where to apply these techniques is the challenge. When synchronizing access to a single field, the volatile keyword will work, but it could prove to be overkill. Sometimes you only need to synchronize writes (in which case a ReaderWriterLockSlim would accomplish the same thing with much better performance). Sometimes you need to manipulate the field multiple times in quick succession, or you must check a field and conditionally manipulate it. In these cases, the lock keyword is a better idea. Sometimes you have multiple threads manipulating shared state in a very loosely-synchronized model to improve performance (not typically recommended). In that case, carefully placed memory barriers can prevent stale and inconsistent data from being used in threads.

gWiz
+1  A: 

So essentially you're asking about the memory consistency model. Some languages/environments, such as Java and .NET, define a memory model, and it's the responsibility of the programmer to not do things which are not allowed, or results in undefined behavior. If you're unsure about the atomicity behavior of "normal" operations, it's better to be safe than sorry and just use the mutex primitives.

For C and C++ the situation is not as nice, as those language standards don't define a memory model. And no, contrary to the unfortunately popular opinion, volatile doesn't guarantee anything wrt atomicity. In this case, you have to rely on the platform threads library (which among other things executes required memory barriers) or compiler/hw-specific atomic intrinsics, and hope that the compiler doesn't do any optimizations that break program semantics. As long as you avoid conditional locking within a function (or translation unit if using IPA) you ought to be relatively safe.

Luckily, C++0x and the next C standard are rectifying this issue by defining a memory model. I asked a question related to this and, as it turned out, conditional locking here; the question contains links to some documents that go into the issue in some detail. I recommend you to read those documents.

janneb