



The following program is essentially the same as the one described here. When I run and compile the program using two threads (NTHREADS == 2), I get the following run times:

real        0m14.120s
user        0m25.570s
sys         0m0.050s

When it is run with just one thread (NTHREADS == 1), I get run times significantly better even though it is only using one core.

real        0m4.705s
user        0m4.660s
sys         0m0.010s

My system is dual core, and I know random_r is thread safe and I am pretty sure it is non-blocking. When the same program is run without random_r and a calculation of cosines and sines is used as a replacement, the dual-threaded version runs in about 1/2 the time as expected.

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define NTHREADS 2
#define PRNG_BUFSZ 8
#define ITERATIONS 1000000000

void* thread_run(void* arg) {
    int r1, i, totalIterations = ITERATIONS / NTHREADS;
    for (i = 0; i < totalIterations; i++){
        random_r((struct random_data*)arg, &r1);
    printf("%i\n", r1);

int main(int argc, char** argv) {
    struct random_data* rand_states = (struct random_data*)calloc(NTHREADS, sizeof(struct random_data));
    char* rand_statebufs = (char*)calloc(NTHREADS, PRNG_BUFSZ);
    pthread_t* thread_ids;
    int t = 0;
    thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
    /* create threads */
    for (t = 0; t < NTHREADS; t++) {
        initstate_r(random(), &rand_statebufs[t], PRNG_BUFSZ, &rand_states[t]);
        pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t]);
    for (t = 0; t < NTHREADS; t++) {
        pthread_join(thread_ids[t], NULL);

I am confused why when generating random numbers the two threaded version performs much worse than the single threaded version, considering random_r is meant to be used in multi-threaded applications.

+8  A: 

A very simple change to space the data out in memory:

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));
char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);
pthread_t* thread_ids;
int t = 0;
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
/* create threads */
for (t = 0; t < NTHREADS; t++) {
    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]);
    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);

results in a much faster running time on my dual-core machine.

This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. Herb Sutter's talk is worth watching if you've got the time if you don't know about that yet.

Work out your cache line size, and create each thread's data so it is aligned to it.

It's a bit cleaner to plonk all the thread's data into a struct, and align that:

#define CACHE_LINE_SIZE 64

struct thread_data {
    struct random_data random_data;
    char statebuf[PRNG_BUFSZ];
    char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];

int main ( int argc, char** argv )
    printf ( "%zd\n", sizeof ( struct thread_data ) );

    void* apointer;

    if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) )
        exit ( 1 );

    struct thread_data* thread_states = apointer;

    memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) );

    pthread_t* thread_ids;

    int t = 0;

    thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) );

    /* create threads */
    for ( t = 0; t < NTHREADS; t++ ) {
        initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data );
        pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data );

    for ( t = 0; t < NTHREADS; t++ ) {
        pthread_join ( thread_ids[t], NULL );

    free ( thread_ids );
    free ( thread_states );


refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthread
refugio:$ time bin/nixuz_random_r 

real    0m1.278s
user    0m2.540s
sys 0m0.000s

Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned.

(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine)

Pete Kirkham
Ugh. This can bite pretty much any small, dense structure that multiple threads are going to try writing to parts of, right?
Nicholas Knight
Thanks a million for your help, I would never have figured this out on my own.Ps. I moved the rand_states and rand_statebufs in to the thread and just initialized the random number generator from there. Which also nicely solves the cache problem in a very simple way.
@Nicholas: Yep. It pays to not be over-mean with memory. Mind you, packing your thread-local allocations together can help too. Thread-locals can be a stupendous win when done right since you can avoid so much cache contention and locking.
Donal Fellows