views:

180

answers:

5

I'm writing a program in C++ to perform a simulation of particular system. For each timestep, the biggest part of the execution is taking up by a single loop. Fortunately this is embarassingly parallel, so I decided to use Boost Threads to parallelize it (I'm running on a 2 core machine). I would expect at speedup close to 2 times the serial version, since there is no locking. However I am finding that there is no speedup at all.

I implemented the parallel version of the loop as follows:

  • Wake up the two threads (they are blocked on a barrier).
  • Each thread then performs the following:

    • Atomically fetch and increment a global counter.
    • Retrieve the particle with that index.
    • Perform the computation on that particle, storing the result in a separate array
    • Wait on a job finished barrier
  • The main thread waits on the job finished barrier.

I used this approach since it should provide good load balancing (since each computation may take differing amounts of time). I am really curious as to what could possibly cause this slowdown. I always read that atomic variables are fast, but now I'm starting to wonder whether they have their performance costs.

If anybody has some ideas what to look for or any hints I would really appreciate it. I've been bashing my head on it for a week, and profiling has not revealed much.

EDIT: Problem solved! I'm going to detail how I solved this problem. I used gprof again, but this time compiled without the optimization flag (-O3). Immediately, the profiler indicated that I was spending an incredible amount of time in the function which performs the computation on each individual particle: much more than in the serial version.

This function is virtual and accessed polymorphically. I changed the code to access it directly rather than through the vtable and voila' the parallel version produced a speedup of nearly 2! The same change on the serial version had little effect.

I'm not sure why this is so, and would be interested if anyone knows!

Thanks to all the posters. You all helped to some extent and it will be very difficult to accept one answer.

+2  A: 

Perform the computation on that particle, storing the result in a separate array

How heavy computations are?

  • Generally speaking atomic counter may cost hundereds of clock cycles and it is quite important to see that you do not only increment counters.
  • ALso try to see how much job each thread does - do they cooperate well (i.e. on each cycle each procceeds about half of particle).
  • Try to subdivide the job to bigget chunks then single particle (let's say 100 particles and so on).
  • See how much job is done outside of threads.

Honestly... it looks like a bug what are you talking about.

Artyom
The computations can be quite heavy and they vary wildly from a few millisecond to taking close to a second. The threads are independent of each other, and I checked their execution time and its roughly equal so load balancing is taking place. I will try your suggestion of considering 100 particles instead of just 1 now
Il-Bhima
>>from a few millisecond to taking close to a second<< if so.. using 100 particles would not help. What may happen that one thread completes fast and waits too much for other thread on barrier that executes long runs... Measure much much time each thread runs before it gets to barrier.
Artyom
+1  A: 

profiling has not revealed much

This is unclear. I have experience profiling a multithreaded application on HP-UX and there their profiler says percent of time each function runs. So if you have one or few contention points in your functions you get increase in time your application spends in these functions. In my case I got significant increase in pthread_mutex_unlock(). When I changed my code it became much faster.

So could you post here the same statistics for one thread and for two/four threads. And number of computations in each test.

Also I recommend you (if it is possible) to set a breakpoint on global function locking a mutex. You might find that somewhere in your algorithm you incidentally lock a global mutex.

skwllsp
A: 

You say profiling has not revealed much, and that (sadly) is typical.

Here's what I would do:

  1. Go back to single-thread.

  2. Make that single thread as fast as possible by using this profiling technique that works in any language and environment. The reason is that profilers (most but not all) are only good at measuring changes, rather than pinpointing what you should fix.

  3. Then go back to 1-thread-per-core, and do the process again. If you find that one thread, or the other, is spending much time blocked on interprocess communication, then you need to re-work that.

Mike Dunlavey
1) The single thread code has been beaten into shape over the last few months, I would be very surprised if it could be made any faster without actually changing the dynamics of the model I am simulating. 2) I'm a fan of that method actually. I remember reading your post some time ago :)3) Sadly there should be no interprocess communication at all. This is the mystery.
Il-Bhima
@Il-Bhima: I would still be really curious to try it, because it costs nothing, and it has never failed to tell me exactly what is (and is not) going on. It is usually very educational. For example, you say there should be no interprocess communication. You can replace that "should be" with "is" or "is not".
Mike Dunlavey
+1  A: 

Your language is kind of revealing:

Wait on xxx

this might be your problem.


Plus you get slow when adding to a single result queue again - you might add the results only at the end of the processing into a single queue if possible. The main thread should not wait, buy check the global counter after every update.
Instead of profiling I would add performance counters which you log at the end. You may put them into conditional compilation error, so that they are not added to your production code.

weismat
I also suspect that maybe the problem. Thus I replaced the barriers with spin locks and I still get the same disappointing performance.I have performance counters based on clock() at the end of most functions which are included when compiling in "DEBUG" mode. They indicate that both threads are doing approximately equal amount of work.
Il-Bhima
+1  A: 

Can I suggest you might find OpenMP easier for this kind of parallelism? Since you just want to make the loop parallel, you don't really want to be working with threads explicitly and this is exactly the kind of thing where OMP can be really effective.

Worth a shot anyhow.

John