views:

149

answers:

2

I have a serial application that I parallelized using OpenMP. I simply added the following to my main loop :

#pragma omp parallel for default(shared)
for (int i = 0; i < numberOfEmitters; ++i)
{
 computeTrajectoryParams* params = new computeTrajectoryParams;
            // defining params...
 outputs[i] = (int*) ComputeTrajectory(params);

 delete params;
}

It seems to work well : at the beginning, all my worker threads execute an iteration of the loop, everything goes fast, and I have a 100% CPU load (on a quad-core machine). However, after a moment, one of the worker thread stops, and stays in a function called _vcomp::PersistentThreadFunc from vcomp90.dll (the file is vctools\openmprt\src\ttpool.cpp), and then another, etc... until only the main thread remains working.

Does anybody have an idea why this happens ? This starts to happen after about half of the iterations have been executed.

+5  A: 

It might depend on the scheduling scheme, and the computation size in each cycle. If the scheduling is static - each thread is assigned with work before it is run. Each thread will get 1/4 of the indexes. It is possible that some threads finish before others because their work is easier than that of other threads (or maybe they are just less loaded with other things).

Try to work with dynamic scheduling, and see if it works better.

Anna
Thanks, this was exactly the problem ! I changed the pragma to `#pragma omp parallel for default(shared) schedule(dynamic)` and now it works as expected : 100% of CPU load the whole time ! Thanks again !
Wookai
Wow. Absolutely correct and no upVote. Here have one from me.
Martin York
I know, I wanted to upvote, but my quota was expired for the day. I came back today ;) !
Wookai
+2  A: 

Little comment on your code: If your ComputeTrajectory's execution time is measured in ms and you have more than a few iterations, you should really make sure you have a memory allocator that is MP optimized, because you allocate in each iteration and (still today) most allocators have a global pool with a global lock.

You could also look into getting the allocation out of the loop entirely, but there is not enough info to know if it is possible here.

Juice
Thanks for the tip. I actually have few of these iterations (typically 256) and the execution time is of the order of seconds.
Wookai