ansaurus

Question

Multi-threaded random_r is slower than single threaded version.

Answer 1

+8 A:

A very simple change to space the data out in memory:

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));
char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);
pthread_t* thread_ids;
int t = 0;
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
/* create threads */
for (t = 0; t < NTHREADS; t++) {
    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]);
    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);
}

results in a much faster running time on my dual-core machine.

This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's talk is worth watching if you've got the time if you don't know about that yet.

Work out your cache line size, and create each thread's data so it is aligned to it.

It's a bit cleaner to plonk all the thread's data into a struct, and align that:

#define CACHE_LINE_SIZE 64

struct thread_data {
    struct random_data random_data;
    char statebuf[PRNG_BUFSZ];
    char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];
};

int main ( int argc, char** argv )
{
    printf ( "%zd\n", sizeof ( struct thread_data ) );

    void* apointer;

    if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) )
        exit ( 1 );

    struct thread_data* thread_states = apointer;

    memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) );

    pthread_t* thread_ids;

    int t = 0;

    thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) );

    /* create threads */
    for ( t = 0; t < NTHREADS; t++ ) {
        initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data );
        pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data );
    }

    for ( t = 0; t < NTHREADS; t++ ) {
        pthread_join ( thread_ids[t], NULL );
    }

    free ( thread_ids );
    free ( thread_states );
}

with CACHE_LINE_SIZE 64:

refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthread
refugio:$ time bin/nixuz_random_r 
64
63499495
944240966

real    0m1.278s
user    0m2.540s
sys 0m0.000s

Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.

(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)

Pete Kirkham 2010-06-08 19:30:44

Ugh. This can bite pretty much any small, dense structure that multiple threads are going to try writing to parts of, right?

Nicholas Knight 2010-06-08 20:03:28

Thanks a million for your help, I would never have figured this out on my own.Ps. I moved the rand_states and rand_statebufs in to the thread and just initialized the random number generator from there. Which also nicely solves the cache problem in a very simple way.

Nixuz 2010-06-08 20:06:38

@Nicholas: Yep. It pays to not be over-mean with memory. Mind you, packing your thread-local allocations together can help too. Thread-locals can be a stupendous win when done right since you can avoid so much cache contention and locking.

Donal Fellows 2010-06-08 20:13:34

ansaurus

tags:

views:

answers:

Multi-threaded random_r is slower than single threaded version.

related questions