views:

385

answers:

5

I'm using monte carlo method to calculate pi and do a basic experience with parallel programming and openmp

the problem is that when i use 1 thread, x iterations, always runs faster than n thread, x iterations. Can anyone tell me why?

For example the code runs like this "a.out 1 1000000", where 1 is threads and 1000000 the iterations

include <omp.h>
include <stdio.h>
include <stdlib.h>
include <iostream>
include <iomanip>
include <math.h>

using namespace std;

int main (int argc, char *argv[]) {

double arrow_area_circle, pi;
float xp, yp;
int i, n;
double pitg= atan(1.0)*4.0; //for pi error
cout << "Number processors: " << omp_get_num_procs() << endl;

//Number of divisions
iterarions=atoi(argv[2]); 
arrow_area_circle = 0.0;

#pragma omp parallel num_threads(atoi(argv[1]))
{
srandom(omp_get_thread_num());

#pragma omp for private(xp, yp) reduction(+:arrow_area_circle) //*,/,-,+
for (i = 0; i < iterarions; i++) {
    xp=rand()/(float)RAND_MAX;
    yp=rand()/(float)RAND_MAX;

    if(pow(xp,2.0)+pow(yp,2.0)<=1) arrow_area_circle++;
}
}

pi = 4*arrow_area_circle / iterarions;
cout << setprecision(18) << "PI = " << pi << endl << endl;
cout << setprecision(18) << "Erro = " << pitg-pi << endl << endl;

return 0;
}
+1  A: 

Context switching.

Marius Burz
+6  A: 

A CPU intensive task like this will be slower if you do the work in more threads than there are CPU's in the system. If you are running it on a single CPU system, you will definitely see a slowdown with more than one thread. This is due to the OS having to switch between the various threads - this is pure overhead. You should ideally have the same number of threads as cores for a task like this.

Another issue is that arrow_area_circle is shared between threads. If you have a thread running on each core, incrementing arrow_area_circle will invalidate the copy in the caches of the other cores, causing them to have to refetch. arrow_area_circle++ which should take a couple cycles will take dozens or hundreds of cycles. Try creating an arrow_area_circle per thread and combining them at the end.

EDIT: Joe Duffy just posted a blog entry on the cost of sharing data between threads.

Michael
i have a core duo, but iḿ going to try a solution to arrow_area_circle
blueomega
You should see a speed-up at 2 threads then and a slow down after that.
Michael
+4  A: 

It looks like you are using some kind of auto-parallelizing compiler. I am going to assume you have more than 1 core/CPU in your system (as that would be too obvious -- and no hyperthreading on a Pentium 4 doesn't count as having two cores, regardless of what Intel's marketing would have you believe.) There are two problems that I see. The first is trivial and probably not your problem:

  1. If the variable arrow_area_circle is shared between your processes, then the act of executing arrow_area_circle++ will cause an interlocking instruction to be used to synchronize the value in a way that is atomically sound. You should increment a "private" variable, then add that value just once at the end to the common arrow_area_circle variable instead of incrementing arrow_area_circle in your inner loop.

  2. The rand() function, to operate soundly, must internally execute with a critical section. The reason is that its internal state/seed is a static shared variable; if it were not, it would be possible for two different processes to get the same output from rand() with unusually high probability, just because they were calling rand() at nearly the same time. That means rand() runs slowly, and especially so as more threads/processes are calling it at the same time. Unlike the arrow_area_circle variable (which just needs an atomic increment), a true critical section has to be invoked by rand() because its state update is more complicated. To work around this, you should obtain the source code for your own random number generator and use it with a private seed or state. The source code for the standard rand() implementation in most compilers is widely available.

I'd also like to point out that you are using the pow(,) function to perform the same thing as x * x. The later is about 300 times faster than the former. Though this point is irrelevant to the question you are asking. :)

Paul Hsieh
A: 

Just to stress that you have to be really careful using random numbers in a parallel setting. In fact you should use something like SPRNG

Whatever you do, make sure that each thread isn't using the same random numbers.

csgillespie
A: 

rand() is blocking function. It means that it has critical section inside.

Vova