ansaurus

Question

C: pthread performance woes. How can I make this code perform as expected?

Answer 1

A:

Threading has a cost. It may be that, as your useful computing code looks very simple, the cost of thread management (cost paid when changing thread and synchronisation cost) is much higher than the benefit.

kriss 2010-09-13 12:04:56

yes, but although it is simple, it does a fair amount of work, depending on the amount of rays I set it to, therefore, theoretically threading should pay off should it not? I'm not really sure myself, still figuring out whether that holds true! Thanks for the comment anyway

kellpossible 2010-09-13 12:26:17

@kellpossible: yout tell it. At best it will depends of the value of rays. At worse the different threads will be scheduled on the same core... For as simple a code I would expect any benefit (if any) to come at a quite high value for rays (maybe thousands). But you do not provide the information. In your case with wich value of rays did you tried your code ?

kriss 2010-09-13 12:52:52

10**8 rays :). with the help from the guys above i'm starting to see benefits for the larger numbers

kellpossible 2010-09-13 13:27:42

@kellpossible: cool. Some feedback telling us at which number of rays you get benefit would be nice (maybe summarize new results after improvements at the end of the question).

kriss 2010-09-13 13:30:23

good idea, i'll do it first thing tomorrow.

kellpossible 2010-09-13 14:03:43

Answer 2

+6 A:

There's a couple of problems I can see:

rand() is not thread-safe. Use drand48_r() (which generates a double in the range [0.0, 1.0) natively, which is what you want)
You only create one struct arguments structure, then try to use that for multiple threads. You need to create a seperate one for each thread (just use an array).

Here's how I'd clean up your approach. Note how we don't need to use any mutexes - each thread just stashes its own return value in a seperate location, and the main thread adds them up after the other threads have finished:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>
#include <pthread.h>

struct thread_info {
    int thread_n;
    pthread_t thread_id;
    int rays;
    int hits_in;
};

void seed_rand(int thread_n, struct drand48_data *buffer)
{
    struct timeval tv;

    gettimeofday(&tv, NULL);
    srand48_r(tv.tv_sec * thread_n + tv.tv_usec, buffer);
}

void *threadFunc(void *arg)
{
    struct thread_info *thread_info = arg;
    struct drand48_data drand_buffer;

    int n = 0;
    const int rays = thread_info->rays;
    int hits_in = 0;
    double x;
    double y;
    double r;

    seed_rand(thread_info->thread_n, &drand_buffer);

    for (n = 0; n < rays; n++)
    {
        drand48_r(&drand_buffer, &x);
        drand48_r(&drand_buffer, &y);
        r = x * x + y * y;
        if (r < 1.0){
            hits_in++;
        }
    }

    thread_info->hits_in = hits_in;
    return NULL;
}


double calculate_pi_mt(int rays, int threads)
{
    int c;
    int hits_in = 0;

    if (rays % threads) {
        printf("Error: number of rays is not evenly divisible by threads\n");
        rays = (rays / threads) * threads;
    }

    /* argument initialization */
    struct thread_info *thr = malloc(threads * sizeof thr[0]);

    for (c = 0; c < threads; c++) {
        thr[c].thread_n = c;
        thr[c].rays = rays / threads;
        thr[c].hits_in = 0;
        if (pthread_create(&thr[c].thread_id, NULL, threadFunc, &thr[c])) {
            printf("Error when creating thread\n");
        }
        printf("Created Thread: %d\n", thr[c].thread_n);
    }

    for (c = 0; c < threads; c++) {
        printf("main waiting for thread %d to terminate...\n", c);
        if (pthread_join(thr[c].thread_id, NULL)) {
            printf("Error while waiting for thread to join\n");
        }
        hits_in += thr[c].hits_in;
        printf("Destroyed Thread: %d\n", c+1);
    }

    printf("Hits in %d\n", hits_in);
    printf("Rays: %d\n", rays);
    double answer = (4.0 * hits_in) / rays;

    free(thr);

    return answer;
}

caf 2010-09-13 12:05:03

yes! that rand() is probably the problem! And thanks for the tip with the range of the drand48_r() function, for it took me a little while how to get the rand() function to produce results in that range!About the struct arguments...if it only used once at the start of the thread, and once at the end, will it still impact on the performance?

kellpossible 2010-09-13 12:28:04

@kellpossible: The argument thing is a correctness issue. Your speed issue is the amount of locking and unlocking, which is completely unnecessary.

caf 2010-09-13 12:38:52

Thanks caf, I'll test this solution... Hopefully it works and I can press the tick button... and hopefully I've learned something from your advice!

kellpossible 2010-09-13 12:41:49

Thanks a lot caf! Now it actually runs marginally faster on 2 threads with 10**8 rays instead of being 3 times slower. With the other optimizations suggested, it runs like a breeze!

kellpossible 2010-09-13 13:39:54

well, i have quantified the advantage now, and it takes 48 seconds to run instead of 50 when set to 2 threads and a certain number of rays. Is this performance gain smaller than to be expected?

kellpossible 2010-09-14 12:53:30

Wow! By temporarily using local variables instead of pointers to the thread-info struct the calculation time for 2 threads halves, but single thread performance remains more or less the same. How strange!

kellpossible 2010-09-14 13:45:09

@kellpossible: The `struct thread_info` is so small (16 bytes on x86) that multiple copies of it fit into one cacheline (usually 64 bytes). This means that that cacheline is "ping-ponging" between each CPU core's cache. You could pad out the `struct thread_info` to a full cacheline - or just use local variables during the calculation as you've done. I've updated the answer accordingly.

caf 2010-09-14 22:27:02

I suspected something along these lines, for what else could explain the slowdown only occuring with multiple threads :). Please excuse my ignorance, but aren't there two copies of thread_info struct, one for each thread?(that could be stored separately, one in each cache) Or is there an element of data that both copies depend on(like types or something weird like that)? Also, padding the struct... can I do this by adding more variables to increase its size to 64 bytes, or can the compiler do it for me? Thanks for your experienced input anyway caf!

kellpossible 2010-09-15 12:21:53

hmm, padding a struct, almost needs a whole seperate question!

kellpossible 2010-09-15 12:35:48

@kellpossible: Caches are arrange in "cachelines", which are blocks of memory that are moved in and out of the cache together - typically a cacheline is 64 bytes. There *are* two `struct thread_info` - but since they are only 16 bytes long and are next to each other in memory, it is likely that they are both within a single cacheline. This means that you *can't* have one in each cache - the entire cacheline containing both gets bounced between the caches each time one of them is written to.

caf 2010-09-15 23:32:08

Answer 3

+7 A:

You're using far too many synchronization primitives. You should sum the local_hits at the end in the main thread, and not use a mutex to update it in an asynchronous fashion. Or, at least, you could use an atomic operation (it's just an int) to do it instead of lock an entire mutex to update one int.

DeadMG 2010-09-13 12:05:07

Thanks for the quick reply! My first time here, hopefully I'll get a chance to answer others. I have tried taking the lock off (which was there as a precaution for I know little about pthreads atm!), but it seems to make no difference. I'm not sure what you mean by summing local_hits at the end in the main thread... but it only transfers its local_hits to the main one that the answer is derived from in the main thread once per thread, and only when it has finished as far as I can see. I'm not sure what an atomic operation is, but do you mean using a "compare and swap" as wikipedia suggests?

kellpossible 2010-09-13 12:21:01

Make an array of local_hits on the caller's stack. Pass pointer to thread function. Have them use that int as their local_hits. When all threads are finished, main sums the array. Edit: You also need to post us what value of threads you are using.

DeadMG 2010-09-13 12:43:41

ansaurus

tags:

views:

answers:

C: pthread performance woes. How can I make this code perform as expected?

related questions