tags:

views:

33

answers:

1

In the code below I'm trying to compare all elements of an array to all other elements in a nested for loop. (It's to run a simple n-body simulation. I'm testing with only 4 bodies for 4 threads on 4 cores). An identical sequential version of the code without OpenMP modifications runs in around 15 seconds for 25M iterations. Last night this code ran in around 30 seconds. Now it runs in around 1 minute! I think the problem may lie in that the threads must write to the array which is passed to the function via a pointer. The array is dynamically allocated elsewhere and is composed of structs I defined. This is just a hunch. I have verified that the 4 threads are running on 4 separate cores at 100% and that they are accessing the elements of the array properly. Any ideas?

  void runSimulation (particle* particles, int numSteps){
    //particles is a pointer to an array of structs I've defined and allocated dynamically before calling the function
    //Variable Initializations


    #pragma omp parallel num_threads(4) private(//The variables inside the loop) shared(k,particles) // 4 Threads for four cores
    {
        while (k<numSteps){ //Main loop.  

            #pragma omp master //Check whether it is time to report progress.
            {
                //Some simple if statements
                k=k+1; //Increment step counter for some reason omp doesn't like k++
            }


            //Calculate new velocities
            #pragma omp for
            for (i=0; i<numParticles; i++){ //Calculate forces by comparing each particle to all others
                Fx = 0;
                Fy = 0;
                for (j=0; j<numParticles; j++){
                    //Calcululate the cumulative force by comparing each particle to all others
                }
                //Calculate accelerations and set new velocities
                ax = Fx / particles[i].mass;
                ay = Fy / particles[i].mass;

                                //ARE THESE TWO LINES THE PROBLEM?!
                particles[i].xVelocity += deltaT*ax;
                particles[i].yVelocity += deltaT*ay;
            }           


            #pragma omp master
            //Apply new velocities to create new positions after all forces have been calculated.
            for (i=0; i<numParticles; i++){
                particles[i].x += deltaT*particles[i].xVelocity;
                particles[i].y += deltaT*particles[i].yVelocity;
            }

            #pragma omp barrier
        }
    }
}
A: 

Not sure if this will fix the problem, but you might try giving each thread its own copy of the full array; the problem might be that the threads are fighting over accessing the shared memory, and you're seeing a lot of cache misses.

I'm not sure of the exact openmp syntax you'd use to do this, but try doing this:

  • Allocate memory to hold the entire particles array in each thread; do this once, and save all four new pointers.
  • At the beginning of each main loop iteration, in the master thread, deep-copy the main array four times to each of those new arrays. You can do this quickly with a memcpy().
  • Do the calculation such that the first thread writes to indices 0 < i < numParticles/4, and so on.
  • In the master thread, before you apply the new velocities, merge the four arrays into the main array by copying over only the relevant indices. You can do this quickly with a memcpy().
  • Note that you can parallelize your "apply new velocities" loop without any problems because each iteration only operates on a single index; this is probably the easiest part to parallelize.

The new operations will only be O(N) compared to your calculations which are O(N^2), so they shouldn't take too much time in the long run. There are definitely ways to optimize the steps that I laid out for you, Gabe, but I'll leave those to you.

btown