views:

985

answers:

6

I have here what I understand to be a relatively simple OpenMP construct. The issue is that the program runs about 100-300x faster with 1 thread when compared to 2 threads. 87% of the program is spent in gomp_send_wait() and another 9.5% in gomp_send_post.

The program gives correct results, but I wonder if there is a flaw in the code that is causing some resource conflict, or if it is simply that the overhead of the thread creation is drastically not worth it for a a loop of chunk size 4. p ranges from 17 to 1000, depending on the size of the molecule we're simulating.

My numbers are for the worst case, when p is 17 and the chunk size 4. The performance is the same whether I'm using static, dynamic, or guided scheduling. With p=150 and chunk size 75, the program is still 75x-100x slower than serial.

...
    double e_t_sum=0.0;
    double e_in_sum=0.0;

 int nthreads,tid;

 #pragma omp parallel for schedule(static, 4) reduction(+ : e_t_sum, e_in_sum) shared(ee_t) private(tid, i, d_x, d_y, d_z, rr,) firstprivate( V_in, t_x, t_y, t_z) lastprivate(nthreads)
 for (i = 0; i < p; i++){
  if (i != c){
   nthreads = omp_get_num_threads();    
   tid = omp_get_thread_num();

   d_x = V_in[i].x - t_x; 
   d_y = V_in[i].y - t_y;
   d_z = V_in[i].z - t_z;


   rr = d_x * d_x + d_y * d_y + d_z * d_z;

   if (i < c){

    ee_t[i][c] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
    e_t_sum += ee_t[i][c]; 
    e_in_sum += ee_in[i][c]; 
   }
   else{

    ee_t[c][i] = energy(rr, V_in[i].q, V_in[c].q, V_in[i].s, V_in[c].s);
    e_t_sum += ee_t[c][i]; 
    e_in_sum += ee_in[c][i]; 
   }

   // if(pid==0){printf("e_t_sum[%d]: %f\n", tid, e_t_sum[tid]);}

  }
 }//end parallel for 


  e_t += e_t_sum;
  e_t -= e_in_sum;   

...
+1  A: 

I believe you should try moving out all those branches (i.e. ifs) inside the loop, and do it in two separate loops, one for i < c, and one for i > c. This would greatly benefit even the single-threaded code, but it should give you more parallelism, even if as you said the thread creation overhead may be larger than the benefits for small n.

Metiu
Thanks for this rec. I think you are absolutely right that it will improve the code to get these two tests out of the inner loop by making two loops. I'll definitely be doing that today. However, the boss wants to see an OpenMP version, and will not be placated with the mere removal of some inner-loop tests. :)
+1  A: 

Metiu is right. You can't expect good performance from a loop that has conditional statements in it. This is just bad coding. Even for scalar performance.

Your boss needs to understand that OpenMP and parallelization in general are not magic. To get good performance out of a parallelized code requires that the basic code be optimized for scalar performance first.

The tests do not have to be removed. The loop needs to be restructured. And scalar performance will benefit also.

In my original post, I made a relative comparison between the code's performance with one thread versus multiple threads. Since "bad" is relative, I think it's fair to say that the threaded performance is bad - relative to the serial version. I do believe the examples of code improvements you have cited are orthogonal to the issue of the OpenMP performance, but if you care to explain why that view is mistaken, I'd be happy to learn something new.Perhaps you can explain what you mean by "the loop needs to be restructured" - in what way?
+1  A: 

You seem to think that it is a given that if you run a serial code in a multitheaded mode it has to perform better. That is not a given. And, it's often not true. Parallelizing a loop to run in multiple threads or multiple processors does not always result in better performance. In most cases, some restructuring is needed. In your case the code isn't even good serial code.

Any book on serial code optimization has as rule number 1 for loops: remove all conditional operations. Tests cost. Some compilers (by the way, you never say what OS/compiler/processor you're using .. it does matter) can try to optimize over conditional code. Some compilers (like Sun's C compiler) even let you run the program in "collect" mode where it generates runtime profile information about how often the branches of a conditional are taken and then let you re-compile in a mode that uses that collected data to optimize the generated code. (See the -xprofile option)

The first rule for optimizing parallel code is first do the best serial optimization you can. Then parallelize the loops.

By moving the conditionals outside the loop and, as Metiu suggests, rewriting the code as two separate loops, you give the optimizer a better source to work with. The serial code runs better, and the parallelized code is embarrassingly parallel.

Still, results may vary depending on OS/compiler/platform.

See Using OpenMP and Solaris Application Programming

Again, you are harping on an admittedly bad, but irrelevant, issue with the code. As it turns out, I wrote the tests out of the loop immediately after Meitu's post. Naturally, it is 15% faster serial, and naturally, has the exact same relative performance issue using OpenMP.Where did I assert that all parallel code runs faster? I specifically asked in this case because the performance was just so much worse, and in the past, I've seen that happen due to a miscode.Now that the dead horsie is sufficiently beaten, perhaps someone else will come in here and take a stab at my question.
A: 

This looks like an issue with the GNU compiler's openmp implementation. Try a different compiler. Intel has a Linux compiler which you should be able to download a copy of and try here.

Another thing I noticed is that the firstprivate variables that you have appear to be quite unnecessary. There might be significant overhead in making private copies of the array V_in which might be your problem.

I'd say it is one of those 2 issues that is your problem.

+1  A: 

First, try pumping the chunk size larger still. Thread creation carries overhead, picking up new work for each thread does too, and the grain-size needs to be large enough to overwhelm that.

One bigger possibility: GOMP's implementation of the reduction may be very poor (suggested by your profile data), and it generates messages after each block, rather than accumulating within each thread and then collecting them at the end. Try allocating e_t_sum and e_in_sum as arrays with nthreads elements each, and adding to e_t_sum[tid] within the loop, then looping over them to compute the global sum after the parallel loop is finished.

Note that this introduces a potential false sharing issue in that multiple elements of these arrays will lie within common cachelines, and multiple processors will be writing to this same cacheline. If you run this on a set of cores that share their cache, this will be fine, but will stink elsewhere.

Another possibility: You might be experiencing false sharing in your updates to the elements of ee_t. Ensure alignment of this array, and try chunk sizes that are multiples of your cache line size. One subtle hint at this pathology would be the part of the loop where i > c taking disproportionately longer than the part where i < c.

Novelocrat

related questions