views:

161

answers:

3

I have created this little program to calculate pi using probability and ratios. In order to make it run faster I decided to give multithreading with pthreads a shot. Unfortunately, even after doing much searching around I was unable to solve the problem I have in that when I run the threadFunc function, with one thread, whether that be with a pthread, or just normally called from the calculate_pi_mt function, the performance is much better (at least twice or if not 3 times better) than when I try running it with two threads on my dual core machine. I have tried disabling optimizations to no avail. As far as I can see, when the thread is running it is using local variables apart from at the end when I have used a mutex lock to create the sum of hits...

Firstly are there any tips for creating code that will run better here? (ie style) because I'm just learning by trying this stuff.

And secondly would there be any reason for these obvious performance problems? When running with number of threads set to 1, one of my cpus maxes out at 100%. When set to two, the second cpu rises to roughly 80%-90%, but all this extra work it is apparently doing is to no avail! Could it be the use of the rand() function?

struct arguments {
    int n_threads;
    int rays;
    int hits_in;
    pthread_mutex_t *mutex;
};


void *threadFunc(void *arg)
{
    struct arguments* args=(struct arguments*)arg;

    int n = 0;
    int local_hits_in = 0;
    double x;
    double y;
    double r;
    while (n < args->rays)
    {
        n++;
        x = ((double)rand())/((double)RAND_MAX);
        y = ((double)rand())/((double)RAND_MAX);
        r = (double)sqrt(pow(x, 2) + pow(y, 2)); 
        if (r < 1.0){
            local_hits_in++;
        }
    }

    pthread_mutex_lock(args->mutex);
    args->hits_in += local_hits_in;
    pthread_mutex_unlock(args->mutex);

    return NULL;
}


double calculate_pi_mt(int rays, int threads){
    double answer;
    int c;
    unsigned int iseed = (unsigned int)time(NULL);
    srand(iseed);

    if ( (float)(rays/threads) != ((float)rays)/((float)threads) ){
        printf("Error: number of rays is not evenly divisible by threads\n");
    }

    /* argument initialization */
    struct arguments* args = malloc(sizeof(struct arguments));
    args->hits_in = 0;
    args->rays = rays/threads;
    args->n_threads = 0;
    args->mutex = malloc(sizeof(pthread_mutex_t));
    if (pthread_mutex_init(args->mutex, NULL)){
        printf("Error creating mutex!\n");
    }


    pthread_t thread_ary[MAXTHREADS];

    c=0;
    while (c < threads){
        args->n_threads += 1;
        if (pthread_create(&(thread_ary[c]),NULL,threadFunc, args)){
            printf("Error when creating thread\n");
        }
        printf("Created Thread: %d\n", args->n_threads);
        c+=1;
    }


    c=0;
    while (c < threads){
        printf("main waiting for thread %d to terminate...\n", c+1);
        if (pthread_join(thread_ary[c],NULL)){
            printf("Error while waiting for thread to join\n");
        }
        printf("Destroyed Thread: %d\n", c+1);

        c+=1;
    }

    printf("Hits in %d\n", args->hits_in);
    printf("Rays: %d\n", rays);
    answer = 4.0 * (double)(args->hits_in)/(double)(rays);

    //freeing everything!
    pthread_mutex_destroy(args->mutex);
    free(args->mutex);
    free(args);

    return answer;
}
A: 

Threading has a cost. It may be that, as your useful computing code looks very simple, the cost of thread management (cost paid when changing thread and synchronisation cost) is much higher than the benefit.

kriss
yes, but although it is simple, it does a fair amount of work, depending on the amount of rays I set it to, therefore, theoretically threading should pay off should it not? I'm not really sure myself, still figuring out whether that holds true! Thanks for the comment anyway
kellpossible
@kellpossible: yout tell it. At best it will depends of the value of rays. At worse the different threads will be scheduled on the same core... For as simple a code I would expect any benefit (if any) to come at a quite high value for rays (maybe thousands). But you do not provide the information. In your case with wich value of rays did you tried your code ?
kriss
10**8 rays :). with the help from the guys above i'm starting to see benefits for the larger numbers
kellpossible
@kellpossible: cool. Some feedback telling us at which number of rays you get benefit would be nice (maybe summarize new results after improvements at the end of the question).
kriss
good idea, i'll do it first thing tomorrow.
kellpossible
+6  A: 

There's a couple of problems I can see:

  • rand() is not thread-safe. Use drand48_r() (which generates a double in the range [0.0, 1.0) natively, which is what you want)
  • You only create one struct arguments structure, then try to use that for multiple threads. You need to create a seperate one for each thread (just use an array).

Here's how I'd clean up your approach. Note how we don't need to use any mutexes - each thread just stashes its own return value in a seperate location, and the main thread adds them up after the other threads have finished:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <pthread.h>

struct thread_info {
    int thread_n;
    pthread_t thread_id;
    int rays;
    int hits_in;
};

void seed_rand(int thread_n, struct drand48_data *buffer)
{
    struct timeval tv;

    gettimeofday(&tv, NULL);
    srand48_r(tv.tv_sec * thread_n + tv.tv_usec, buffer);
}

void *threadFunc(void *arg)
{
    struct thread_info *thread_info = arg;
    struct drand48_data drand_buffer;

    int n = 0;
    const int rays = thread_info->rays;
    int hits_in = 0;
    double x;
    double y;
    double r;

    seed_rand(thread_info->thread_n, &drand_buffer);

    for (n = 0; n < rays; n++)
    {
        drand48_r(&drand_buffer, &x);
        drand48_r(&drand_buffer, &y);
        r = x * x + y * y;
        if (r < 1.0){
            hits_in++;
        }
    }

    thread_info->hits_in = hits_in;
    return NULL;
}


double calculate_pi_mt(int rays, int threads)
{
    int c;
    int hits_in = 0;

    if (rays % threads) {
        printf("Error: number of rays is not evenly divisible by threads\n");
        rays = (rays / threads) * threads;
    }

    /* argument initialization */
    struct thread_info *thr = malloc(threads * sizeof thr[0]);

    for (c = 0; c < threads; c++) {
        thr[c].thread_n = c;
        thr[c].rays = rays / threads;
        thr[c].hits_in = 0;
        if (pthread_create(&thr[c].thread_id, NULL, threadFunc, &thr[c])) {
            printf("Error when creating thread\n");
        }
        printf("Created Thread: %d\n", thr[c].thread_n);
    }

    for (c = 0; c < threads; c++) {
        printf("main waiting for thread %d to terminate...\n", c);
        if (pthread_join(thr[c].thread_id, NULL)) {
            printf("Error while waiting for thread to join\n");
        }
        hits_in += thr[c].hits_in;
        printf("Destroyed Thread: %d\n", c+1);
    }

    printf("Hits in %d\n", hits_in);
    printf("Rays: %d\n", rays);
    double answer = (4.0 * hits_in) / rays;

    free(thr);

    return answer;
}
caf
yes! that rand() is probably the problem! And thanks for the tip with the range of the drand48_r() function, for it took me a little while how to get the rand() function to produce results in that range!About the struct arguments...if it only used once at the start of the thread, and once at the end, will it still impact on the performance?
kellpossible
@kellpossible: The argument thing is a correctness issue. Your speed issue is the amount of locking and unlocking, which is completely unnecessary.
caf
Thanks caf, I'll test this solution... Hopefully it works and I can press the tick button... and hopefully I've learned something from your advice!
kellpossible
Thanks a lot caf! Now it actually runs marginally faster on 2 threads with 10**8 rays instead of being 3 times slower. With the other optimizations suggested, it runs like a breeze!
kellpossible
well, i have quantified the advantage now, and it takes 48 seconds to run instead of 50 when set to 2 threads and a certain number of rays. Is this performance gain smaller than to be expected?
kellpossible
Wow! By temporarily using local variables instead of pointers to the thread-info struct the calculation time for 2 threads halves, but single thread performance remains more or less the same. How strange!
kellpossible
@kellpossible: The `struct thread_info` is so small (16 bytes on x86) that multiple copies of it fit into one cacheline (usually 64 bytes). This means that that cacheline is "ping-ponging" between each CPU core's cache. You could pad out the `struct thread_info` to a full cacheline - or just use local variables during the calculation as you've done. I've updated the answer accordingly.
caf
I suspected something along these lines, for what else could explain the slowdown only occuring with multiple threads :). Please excuse my ignorance, but aren't there two copies of thread_info struct, one for each thread?(that could be stored separately, one in each cache) Or is there an element of data that both copies depend on(like types or something weird like that)? Also, padding the struct... can I do this by adding more variables to increase its size to 64 bytes, or can the compiler do it for me? Thanks for your experienced input anyway caf!
kellpossible
hmm, padding a struct, almost needs a whole seperate question!
kellpossible
@kellpossible: Caches are arrange in "cachelines", which are blocks of memory that are moved in and out of the cache together - typically a cacheline is 64 bytes. There *are* two `struct thread_info` - but since they are only 16 bytes long and are next to each other in memory, it is likely that they are both within a single cacheline. This means that you *can't* have one in each cache - the entire cacheline containing both gets bounced between the caches each time one of them is written to.
caf
+7  A: 

You're using far too many synchronization primitives. You should sum the local_hits at the end in the main thread, and not use a mutex to update it in an asynchronous fashion. Or, at least, you could use an atomic operation (it's just an int) to do it instead of lock an entire mutex to update one int.

DeadMG
Thanks for the quick reply! My first time here, hopefully I'll get a chance to answer others. I have tried taking the lock off (which was there as a precaution for I know little about pthreads atm!), but it seems to make no difference. I'm not sure what you mean by summing local_hits at the end in the main thread... but it only transfers its local_hits to the main one that the answer is derived from in the main thread once per thread, and only when it has finished as far as I can see. I'm not sure what an atomic operation is, but do you mean using a "compare and swap" as wikipedia suggests?
kellpossible
Make an array of local_hits on the caller's stack. Pass pointer to thread function. Have them use that int as their local_hits. When all threads are finished, main sums the array. Edit: You also need to post us what value of threads you are using.
DeadMG