views:

185

answers:

3

I have a large C++ program that modifies the FPU control word (using _controlfp()). It unmasks some FPU exceptions and installs a SEHTranslator to produce typed C++ exceptions. I am using VC++ 9.0.

I would like to use OpenMP (v.2.0) to parallelize some of our computational loops. I've already successfully applied it to one, but the numerical results are slightly different (though I understand it could also be due to calculations being performed in a different order). I'm assuming this is because the FPU state is thread-specific. Is there some way to have the OpenMP threads inherit that state from the master thread? Or is there some way to specify using OpenMP that new threads execute a particular function that sets up the correct state? What is the idiomatic way to deal with this situation?

A: 

The likelihood is that this has to do with the ordering of the floating point operations. We all rely on our operations being associative and commutative, but the unfortunate truth is that floating point operations aren't commutative so when they are parallelized, the results may vary because the order gets randomized.

Try running your loops backwards and seeing if the result differs.

If you do have per thread needs, OMP provides guarantees about iterations of loops falling on the same threads, i.e. if you loop is from 1 to N on a quad core, iterations 1 to N/4 will be run on the same thread.

-Rick

Rick
+1  A: 
  1. As you pointed out already, double/float operations are not associative/commutative/distribute as real numbers in math. Especially, multiplying/dividing huge number/very small number may lead noticeable precision errors when you change the order of computation.

  2. FPU state is should be thread-specific as the state is represented as a register and register status (=context) are specific to a thread.

  3. It is ambiguous to say that spawned threads inherit the master thread's state because state is not clear in this context. If you means register status, then it is not.

  4. My suggestion is why don't you simply set FPU control word per each thread? For example, before spawning OpenMP thread, i.e., before parallel-for, store the current FPU control word in a global variable by using _status87. Then, put statements that reads the global variable and sets a new value in parallel-for iteration. Since it is read-only on the global variable, you don't worry about any data race.

unsigned int saved_status = _status87();
#pragma omp parallel for (...)
for (int i = 0; i < N; ++i)
{
  _controlfp(saved_status, ...);

  ..
}
minjang
A: 

I've concluded that I do not have a problem. The differences in results are due to the order of calculations, not to the FPU state in different threads (we are not changing precision or rounding modes). As for FPU exception masking being different in the worker threads, that is not a concern because if a worker thread performs an operation that would result in an exception, that result (now NaN or Inf, etc.) will eventually "factor in" to the main thread and the exception will be thrown.

In addition, an exception must be caught in the same OpenMP thread that threw it. This means I only want the master thread to be able to throw exceptions anyhow.