views:

67

answers:

2

Hello,

I am runig a program on a server at my university that has 4 Dual-Core AMD Opteron(tm) Processor 2210 HE and the O.S. is Linux version 2.6.27.25-78.2.56.fc9.x86_64. My program implements Conways Game of Life and it runs using pthreads and openmp. I timed the parrallel part of the program using the getimeofday() function using 1-8 threads. But the timings don't seem right. I get the biggest time using 1 thread(as expected), then the time gets smaller. But the smallest time i get is when i use 4 threads.

Here is an example when i use an array 1000x1000.

Using 1 thread~9,62 sec, Using 2 Threads~4,73 sec, Using 3 ~ 3.64 sec, Using 4~2.99 sec, Using 5 ~4,19 sec, Using 6~3.84, Using 7~3.34, Using 8~3.12.

The above timings are when i use pthreads. When i use openmp the timing are smaller but follow the same pattern.

I expected that the time would decrease from 1-8 because of the 4 Dual core cpus? I thought that because there are 4 cpus with 2 cores each, 8 threads could run at the same time. Does it have to do with the operating system that the server runs?

Also i tested the same programs on another server that has 7 Dual-Core AMD Opteron(tm) Processor 8214 and runs Linux version 2.6.18-194.3.1.el5. There the timings i get are what i expected. The timings get smaller starting from 1(the biggest) to 8(smallest excecution time).

The program implements the Game of Life correct, both using pthreads and openmp, i just cant figure out why the timings are like the example i posted. So in conclusion, my questions are:

1) The number of threads that can run at the same time on a system depends by the cores of the cpus?it depends only by the cpus although each cpu has more than one cores? It depends by all the previous and the Operating System?

2) Does it have to do with the way i divide the 1000x1000 array to the number of threads? But if i did then the openmp code wouldn't give the same pattern of timings?

3)What is the reason i might get such timmings?

excuse my english i am from europe... thnx in advanse.

EDIT: This is the code i use with openmp:

#define  Row 1000+2
#define Col 1000+2 int num; int (*temp)[Col]; int (*a1)[Col]; int (*a2)[Col];

int main() {
        int i,j,l,sum;
        int array1[Row][Col],array2[Row][Col];
        struct timeval tim;
        struct tm *tm;
        double start,end;
        int st,en;

           for (i=0; i<Row; i++)
                for (j=0; j<Col; j++)
                   {
                    array1[i][j]=0;
                    array2[i][j]=0;
                  }
        array1[3][16]=1;
        array1[4][16]=1;
        array1[5][15]=1;
        array1[6][15]=1;
        array1[6][16]=1;
        array1[7][16]=1;
        array1[5][14]=1;
        array1[4][15]=1;
        a1=array1;
        a2=array2;
        printf ("\nGive number of threads:");
        scanf("%d",&num);

        gettimeofday(&tim,NULL);
        start=tim.tv_sec+(tim.tv_usec/1000000.0); omp_set_num_threads(num);
        #pragma omp parallel private(l,i,j,sum)
        {

                printf("Number of Threads:%d\n",omp_get_num_threads());
                for (l=0; l<100; l++)
                        {
                        #pragma omp for
                        for (i=1; i<(Row-1); i++)
                        {
                                for (j=1; j<(Col-1); j++)
                                {  
                                        sum=a1[i-1][j-1]+a1[i-1][j]+a1[i-1][j+1]+a1[i][j-1]+a1[i][j+1]+a1[i+1][j-1]+a1[i+1][j]+a1[i+1][j+1];
                                        if ((a1[i][j]==1) && (sum==2||sum==3))
                                                a2[i][j]=1;
                                        else if ((a1[i][j]==1) && (sum<2))
                                                a2[i][j]=0;
                                        else if ((a1[i][j]==1) && (sum>3))
                                                a2[i][j]=0;
                                        else if ((a1[i][j]==0 )&& (sum==3))  
                                                a2[i][j]=1;
                                        else if (a1[i][j]==0)
                                                a2[i][j]=0;

                                }//end of iteration J
                        }//end of iteration I
                        #pragma omp barrier
                        #pragma omp single
                                {

                                        temp=a1;
                                        a1=a2;
                                        a2=temp;
                                }

                        #pragma omp barrier
                        }//end of iteration L

        }//end of paraller region
        gettimeofday(&tim,NULL);
         end=tim.tv_sec+(tim.tv_usec/1000000.0);
         printf("\nTime Elapsed:%.6lf\n",end-start);
        printf("all ok\n");
        return 0; }

TIMINGS with openmp code

a)System with 7 Dual Core Cpus Using 1 thread~7,72 sec, Using 2 threads~4,53 sec, Using 3 Threads~3,64 sec, Using 4 threads~ 2,24 sec, Using 5~2,02 sec, Using 6~ 1,78 sec, Using 7 ~1,59 sec,Using 8 ~ 1,44 sec

b)System with 4 Dual Core Cpus Using 1 thread~9,06 sec, Using 2 threads~4,86 sec, Using 3 Threads~3,49 sec, Using 4 threads~ 2,61 sec, Using 5~3,98 sec, Using 6~ 3,53 sec, Using 7 ~3,48 sec,Using 8 ~ 3,32 sec

Above are the timings i get.

A: 

One thing you have to remember is that you're doing this on a shared memory architecture. The more loads/stores you are trying to do in parallel, the more chance you're going to have to hit contention with regards to memory access, which is a relatively slow operation. So in typical applications in my experience, don't benefit from more than 6 cores. (This is anecdotal, I could go into a lot of detail, but I don't feel like typing. Suffice to say, take these numbers with a grain of salt).

Try instead to minimize access to shared resources if possible, see what that does to your performance. Otherwise, optimize for what you got, and remember this:

Throwing more cores at a problem does not mean it will go quicker. Like with taxation, there's a curve as to when the number of cores, starts becoming a detriment to collecting the most performance out of your program. Find that "sweet spot", and use it.

jer
Hello and thanx for the answer. Yes i understand that there are data dependencies but(solving them is beyond the goal of my program) the code being as it is, when i test it on the system with 4 Dual Core cpus(so 8 cores) and the system that has 7 Dual Core Cpus the data dependencies and cache thrashing and update policy are the same, the cache size of each hasnt changed. But with the system that has 7 Dual Core Cpus i get the timings i expected but with the other system that has 8 cores(and the max number of threads i use is still 8) the smallest execution is when i use 4 threads.
stois21
A: 

You write

The above timings are when i use pthreads. When i use openmp the timing are smaller but follow the same pattern.

Congratulations, you have discovered the pattern which all parallel programs follow ! If you plot execution time against number of processors the curve eventually flattens out and starts to rise; you reach a point where adding more processors slows things down.

The interesting question is how many processors you can profitably use and the answer to this is dependent on many factors. @jer has pointed out some of the factors which affect the scalability of programs on shared-memory computers. Other factors, principally the ratio of communication to computation, ensure that the shape of the performance curve will be the same on distributed-memory computers too.

The other factor which is important when measuring the parallel scalability of your program is the problem size(s) you use. How does your performance curve change when you try a grid of 1414 x 1414 cells ? I would expect that the curve will be below the curve for the problem on 1000 x 1000 cells and will flatten out later.

For further reading Google for Amdahl's Law and Gustafson's Law.

High Performance Mark
Hello and thanx for the answer, so even though the one system has 8 cores and the max number of threads i use is 8, its logical that the smallest execution time i get is only with 4 threads?And when i test my code on the system that has 14 cores as you can see in my edited post the timing are what would someone expect from a 14 core system when the max number of threads you use are 8.
stois21
Yes, none of what you report is a surprise. I'd be disappointed if I created a program which minimised execution time on only 4 out of 8 cores, but not surprised. Did you try running larger problems ?
High Performance Mark
The biggest size of array i can try is 1000x1000 when i tried anything a lot bigger it didnt compile. So ok for the system that has 8 cores due to scheduling, memory transfers and the update policy of the system architecture the smallest time is for 4 threads. But then just by incresing the cpu cores i get the timings i expected? Isnt it a fact that just by increasing the hardware you dont get an improvment?
stois21