views:

206

answers:

5

Hey all,

So I've been playing around with pthreads, specifically trying to calculate the product of two matrices. My code is extremely messy because it was just supposed to be a quick little fun project for myself, but the thread theory I used was very similar to:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

#define M 3
#define K 2
#define N 3
#define NUM_THREADS 10

int A [M][K] = { {1,4}, {2,5}, {3,6} };
int B [K][N] = { {8,7,6}, {5,4,3} };
int C [M][N];

struct v {
   int i; /* row */
   int j; /* column */
};

void *runner(void *param); /* the thread */

int main(int argc, char *argv[]) {

   int i,j, count = 0;
   for(i = 0; i < M; i++) {
      for(j = 0; j < N; j++) {
         //Assign a row and column for each thread
         struct v *data = (struct v *) malloc(sizeof(struct v));
         data->i = i;
         data->j = j;
         /* Now create the thread passing it data as a parameter */
         pthread_t tid;       //Thread ID
         pthread_attr_t attr; //Set of thread attributes
         //Get the default attributes
         pthread_attr_init(&attr);
         //Create the thread
         pthread_create(&tid,&attr,runner,data);
         //Make sure the parent waits for all thread to complete
         pthread_join(tid, NULL);
         count++;
      }
   }

   //Print out the resulting matrix
   for(i = 0; i < M; i++) {
      for(j = 0; j < N; j++) {
         printf("%d ", C[i][j]);
     }
      printf("\n");
   }
}

//The thread will begin control in this function
void *runner(void *param) {
   struct v *data = param; // the structure that holds our data
   int n, sum = 0; //the counter and sum

   //Row multiplied by column
   for(n = 0; n< K; n++){
      sum += A[data->i][n] * B[n][data->j];
   }
   //assign the sum to its coordinate
   C[data->i][data->j] = sum;

   //Exit the thread
   pthread_exit(0);
}

source: http://macboypro.com/blog/2009/06/29/matrix-multiplication-in-c-using-pthreads-on-linux/

For the non-threaded version, I used the same setup (3 2-d matrices, dynamically allocated structs to hold r/c), and added a timer. First trials indicated that the non-threaded version was faster. My first thought was that the dimensions were too small to notice a difference, and it was taking longer to create the threads. So I upped the dimensions to about 50x50, randomly filled, and ran it, and I'm still not seeing any performance upgrade with the threaded version.

What am I missing here?

+10  A: 

Unless you're working with very large matrices (many thousands of rows/columns), then you are unlikely to see much improvement from this approach. Setting up a thread on a modern CPU/OS is actually pretty expensive in relative terms of CPU time, much more time than a few multiply operations.

Also, it's usually not worthwhile to set up more than one thread per CPU core that you have available. If you have, say, only two cores and you set up 2500 threads (for 50x50 matrices), then the OS is going to spend all its time managing and switching between those 2500 threads rather than doing your calculations.

If you were to set up two threads beforehand (still assuming a two-core CPU), keep those threads available all the time waiting for work to do, and supply them with the 2500 dot products you need to calculate in some kind of synchronised work queue, then you might start to see an improvement. However, it still won't ever be more than 50% better than using only one core.

Greg Hewgill
The one caveat to that being the situation where you have a UI thread and a worker thread.
Chris Thompson
@Chris Thompson: Your UI thread is unlikely to be using much CPU power. The advantage to having a separate UI thread is to not *block* your UI thread while doing computation, which keeps your UI responsive.
Greg Hewgill
@Greg, right. That's what I meant :-)
Chris Thompson
+1  A: 

You don't allow much parallel execution: you wait for the thread immediately after creating it, so there is almost no way for your program to use additional CPUs (i.e. it can never use a third CPU/core). Try to allow more threads to run (probably up to the count of cores you have).

brittle
+3  A: 

I'm not entirely sure I understand the source code, but here's what it looks like: You have a loop that runs M*N times. Each time through the loop, you create a thread that fills in one number in the result matrix. But right after you launch the thread, you wait for it to complete. I don't think that you're ever actually running more than one thread.

Even if you were running more than one thread, the thread is doing a trivial amount of work. Even if K was large (you mention 50), 50 multiplications isn't much compared to the cost of starting the thread in the first place. The program should create fewer threads--certainly no more than the number of processors--and assign more work to each.

Willis Blackburn
+1  A: 

If you have a processor with two cores, then you should just divide the work to be done in two halfs and give each thread one half. The same principle if you have 3, 4, 5 cores. The optimal performance design will always match the number of threads to the number of available cores (by available I mean cores that aren't already being heavily used by other processes).

One other thing you have to consider is that each thread must have its data contiguous and independent from the data for the other threads. Otherwise, memcache misses will slow down sighificantly the processing.

To better understand these issues, I'd recommend the book Patterns for Parallel Programming http://astore.amazon.com/amazon-books-20/detail/0321228111

Although its code samples are more directed to OpenMP & MPI, and you're using PThreads, still the first half of the book is very rich in fundamental concepts & inner working of multithreading environments, very useful to avoid most of the performance bottlenecks you'll encounter.

Fabio Ceconello
A: 

Provided the code parallelizes correctly (I won't check it), likely performance boosts only when the code is parallelized in hardware, i.e. threads are really parallel (multi cores, multi cpus, ... other techologies...) and not apparently ("multitasking" way) parallel. Just an idea, I am not sure this is the case.

ShinTakezou