tags:

views:

73

answers:

2

I have a small C++ program using OpenMP. It works fine on Windows7, Core i7 with VisualStudio 2010. On an iMac with a Core i7 and g++ v4.2.1, the code runs much more slowly using 4 threads than it does with just one. The same 'slower' behavior is exihibited on 2 other Red Hat machines using g++. Here is the code:

   int iHundredMillion = 100000000;
   int iNumWorkers = 4;
   std::vector<Worker*> workers;

   for(int i=0; i<iNumWorkers; ++i)
   {
      Worker * pWorker = new Worker();
      workers.push_back(pWorker);
   }

   int iThr;

   #pragma omp parallel for  private (iThr)     // Parallel run
   for(int k=0; k<iNumWorkers; ++k)
   {
      iThr = omp_get_thread_num();
      workers[k]->Run( (3)*iHundredMillion, iThr );
   }

I'm compiling with g++ like this:

g++ -fopenmp -O2 -o a.out *.cpp

Can anyone tell me what silly mistake I'm making on the *nix platform?

A: 

It's impossible to answer given the information provided, but one guess could be that your code is designed so it can't be executed efficiently on multiple threads.

I haven't worked a lot with OMP, but I believe it is allowed to use fewer worker threads than specified. In that case, some implementations could be clever enough to realize that the code can't be efficiently parallellized, and just run it on a single thread, while others naively try to run it on 4 cores, and suffer the performance penalty (due to false (or real) sharing, for example)

Some of the information that'd be necessary in order to give you a reasonable answer is:

  • the actual timings (how long does the code take to run on a single thread? How long with 4 threads using OM? How long with 4 threads using "regular" threads?
  • the data layout: which data is allocated where, and when is it accessed?
  • what actually happens inside the loop? All we can see at the moment is a multiplication and a function call. As long as we don't know what happens inside the function, you might as well have posted this code: foo(42) and asked why it doesn't return the expected result.
jalf
Ok, thanks for your answer. The fact that the code worked well when compiled with VisualStudio 2010 (ran in one-third the time) using 4 threads made me think the code was ok, but I was maybe building it incorrectly with g++.
MrTurtle
It's impossible to say until you post your code. I'm just guessing, and I might be completely wrong.
jalf
+1  A: 

I'm thinking that the g++ compiler is not optimizing as well as the visual studio compiler. Can you try other optimization levels (like -O3) and see if it makes a difference?

Or you could try some other compiler. Intel offers free compilers for linux for non-commercial purposes.

http://software.intel.com/en-us/articles/non-commercial-software-development/

agg