views:

127

answers:

1

I wanted to know what would be better/faster to use POSIX calls like pthread_once() and sem_wait() or the dispatch_* functions so I created a little test and am surprised at the results (questions and results are at the end).

In the test code I am using mach_absolute_time() to time the calls. I really dont care that this is not exactly matching up with nano-seconds, I am comparing the values with each other so the exact time units don't matter, only the differences between the interval matter. The numbers in the results section are repeatable and not averaged, I could have averaged the times but I am not looking for exact numbers.

test.m (simple console application; easy to compile):

#import <Foundation/Foundation.h>
#import <dispatch/dispatch.h>
#include <semaphore.h>
#include <pthread.h>
#include <time.h>
#include <mach/mach_time.h>  

// *sigh* OSX does not have pthread_barrier (you can ignore the pthread_barrier 
// code, the interesting stuff is lower)
typedef int pthread_barrierattr_t;
typedef struct
{
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    int count;
    int tripCount;
} pthread_barrier_t;


int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned int count)
{
    if(count == 0)
    {
        errno = EINVAL;
        return -1;
    }
    if(pthread_mutex_init(&barrier->mutex, 0) < 0)
    {
        return -1;
    }
    if(pthread_cond_init(&barrier->cond, 0) < 0)
    {
        pthread_mutex_destroy(&barrier->mutex);
        return -1;
    }
    barrier->tripCount = count;
    barrier->count = 0;

    return 0;
}

int pthread_barrier_destroy(pthread_barrier_t *barrier)
{
    pthread_cond_destroy(&barrier->cond);
    pthread_mutex_destroy(&barrier->mutex);
    return 0;
}

int pthread_barrier_wait(pthread_barrier_t *barrier)
{
    pthread_mutex_lock(&barrier->mutex);
    ++(barrier->count);
    if(barrier->count >= barrier->tripCount)
    {
        barrier->count = 0;
        pthread_cond_broadcast(&barrier->cond);
        pthread_mutex_unlock(&barrier->mutex);
        return 1;
    }
    else
    {
        pthread_cond_wait(&barrier->cond, &(barrier->mutex));
        pthread_mutex_unlock(&barrier->mutex);
        return 0;
    }
}

//
// ok you can start paying attention now...
//

void onceFunction(void)
{
}

@interface SemaphoreTester : NSObject
{
    sem_t *sem1;
    sem_t *sem2;
    pthread_barrier_t *startBarrier;
    pthread_barrier_t *finishBarrier;
}
@property (nonatomic, assign) sem_t *sem1;
@property (nonatomic, assign) sem_t *sem2;
@property (nonatomic, assign) pthread_barrier_t *startBarrier;
@property (nonatomic, assign) pthread_barrier_t *finishBarrier;
@end
@implementation SemaphoreTester
@synthesize sem1, sem2, startBarrier, finishBarrier;
- (void)thread1
{
    pthread_barrier_wait(startBarrier);
    for(int i = 0; i < 100000; i++)
    {
        sem_wait(sem1);
        sem_post(sem2);
    }
    pthread_barrier_wait(finishBarrier);
}

- (void)thread2
{
    pthread_barrier_wait(startBarrier);
    for(int i = 0; i < 100000; i++)
    {
        sem_wait(sem2);
        sem_post(sem1);
    }
    pthread_barrier_wait(finishBarrier);
}
@end


int main (int argc, const char * argv[]) 
{
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
    int64_t start;
    int64_t stop;

    // semaphore non contention test
    {
        // grrr, OSX doesn't have sem_init
        sem_t *sem1 = sem_open("sem1", O_CREAT, 0777, 0);

        start = mach_absolute_time();
        for(int i = 0; i < 100000; i++)
        {
            sem_post(sem1);
            sem_wait(sem1);
        }
        stop = mach_absolute_time();
        sem_close(sem1);

        NSLog(@"0 Contention time                         = %d", stop - start);
    }

    // semaphore contention test
    {
        __block sem_t *sem1 = sem_open("sem1", O_CREAT, 0777, 0);
        __block sem_t *sem2 = sem_open("sem2", O_CREAT, 0777, 0);
        __block pthread_barrier_t startBarrier;
        pthread_barrier_init(&startBarrier, NULL, 3);
        __block pthread_barrier_t finishBarrier;
        pthread_barrier_init(&finishBarrier, NULL, 3);

        dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_LOW, 0);
        dispatch_async(queue, ^{
            pthread_barrier_wait(&startBarrier);
            for(int i = 0; i < 100000; i++)
            {
                sem_wait(sem1);
                sem_post(sem2);
            }
            pthread_barrier_wait(&finishBarrier);
        });
        dispatch_async(queue, ^{
            pthread_barrier_wait(&startBarrier);
            for(int i = 0; i < 100000; i++)
            {
                sem_wait(sem2);
                sem_post(sem1);
            }
            pthread_barrier_wait(&finishBarrier);
        });
        pthread_barrier_wait(&startBarrier);
        // start timing, everyone hit this point
        start = mach_absolute_time();
        // kick it off
        sem_post(sem2);
        pthread_barrier_wait(&finishBarrier);
        // stop timing, everyone hit the finish point
        stop = mach_absolute_time();
        sem_close(sem1);
        sem_close(sem2);
        NSLog(@"2 Threads always contenting time          = %d", stop - start);
        pthread_barrier_destroy(&startBarrier);
        pthread_barrier_destroy(&finishBarrier);
    }   

    // NSTask semaphore contention test
    {
        sem_t *sem1 = sem_open("sem1", O_CREAT, 0777, 0);
        sem_t *sem2 = sem_open("sem2", O_CREAT, 0777, 0);
        pthread_barrier_t startBarrier;
        pthread_barrier_init(&startBarrier, NULL, 3);
        pthread_barrier_t finishBarrier;
        pthread_barrier_init(&finishBarrier, NULL, 3);

        SemaphoreTester *tester = [[[SemaphoreTester alloc] init] autorelease];
        tester.sem1 = sem1;
        tester.sem2 = sem2;
        tester.startBarrier = &startBarrier;
        tester.finishBarrier = &finishBarrier;
        [NSThread detachNewThreadSelector:@selector(thread1) toTarget:tester withObject:nil];
        [NSThread detachNewThreadSelector:@selector(thread2) toTarget:tester withObject:nil];
        pthread_barrier_wait(&startBarrier);
        // start timing, everyone hit this point
        start = mach_absolute_time();
        // kick it off
        sem_post(sem2);
        pthread_barrier_wait(&finishBarrier);
        // stop timing, everyone hit the finish point
        stop = mach_absolute_time();
        sem_close(sem1);
        sem_close(sem2);
        NSLog(@"2 NSTasks always contenting time          = %d", stop - start);
        pthread_barrier_destroy(&startBarrier);
        pthread_barrier_destroy(&finishBarrier);
    }   

    // dispatch_semaphore non contention test
    {
        dispatch_semaphore_t sem1 = dispatch_semaphore_create(0);

        start = mach_absolute_time();
        for(int i = 0; i < 100000; i++)
        {
            dispatch_semaphore_signal(sem1);
            dispatch_semaphore_wait(sem1, DISPATCH_TIME_FOREVER);
        }
        stop = mach_absolute_time();

        NSLog(@"Dispatch 0 Contention time                = %d", stop - start);
    }


    // dispatch_semaphore non contention test
    {   
        __block dispatch_semaphore_t sem1 = dispatch_semaphore_create(0);
        __block dispatch_semaphore_t sem2 = dispatch_semaphore_create(0);
        __block pthread_barrier_t startBarrier;
        pthread_barrier_init(&startBarrier, NULL, 3);
        __block pthread_barrier_t finishBarrier;
        pthread_barrier_init(&finishBarrier, NULL, 3);

        dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_LOW, 0);
        dispatch_async(queue, ^{
            pthread_barrier_wait(&startBarrier);
            for(int i = 0; i < 100000; i++)
            {
                dispatch_semaphore_wait(sem1, DISPATCH_TIME_FOREVER);
                dispatch_semaphore_signal(sem2);
            }
            pthread_barrier_wait(&finishBarrier);
        });
        dispatch_async(queue, ^{
            pthread_barrier_wait(&startBarrier);
            for(int i = 0; i < 100000; i++)
            {
                dispatch_semaphore_wait(sem2, DISPATCH_TIME_FOREVER);
                dispatch_semaphore_signal(sem1);
            }
            pthread_barrier_wait(&finishBarrier);
        });
        pthread_barrier_wait(&startBarrier);
        // start timing, everyone hit this point
        start = mach_absolute_time();
        // kick it off
        dispatch_semaphore_signal(sem2);
        pthread_barrier_wait(&finishBarrier);
        // stop timing, everyone hit the finish point
        stop = mach_absolute_time();

        NSLog(@"Dispatch 2 Threads always contenting time = %d", stop - start);
        pthread_barrier_destroy(&startBarrier);
        pthread_barrier_destroy(&finishBarrier);
    }   

    // pthread_once time
    {
        pthread_once_t once = PTHREAD_ONCE_INIT;
        start = mach_absolute_time();
        for(int i = 0; i <100000; i++)
        {
            pthread_once(&once, onceFunction);
        }
        stop = mach_absolute_time();

        NSLog(@"pthread_once time  = %d", stop - start);
    }

    // dispatch_once time
    {
        dispatch_once_t once = 0;
        start = mach_absolute_time();
        for(int i = 0; i <100000; i++)
        {
            dispatch_once(&once, ^{});
        }
        stop = mach_absolute_time();

        NSLog(@"dispatch_once time = %d", stop - start);
    }

    [pool drain];
    return 0;
}

On My iMac (Snow Leopard Server 10.6.4):

  Model Identifier: iMac7,1
  Processor Name:   Intel Core 2 Duo
  Processor Speed:  2.4 GHz
  Number Of Processors: 1
  Total Number Of Cores:    2
  L2 Cache: 4 MB
  Memory:   4 GB
  Bus Speed:    800 MHz

I get:

0 Contention time                         =    101410439
2 Threads always contenting time          =    109748686
2 NSTasks always contenting time          =    113225207
0 Contention named semaphore time         =    166061832
2 Threads named semaphore contention time =    203913476
2 NSTasks named semaphore contention time =    204988744
Dispatch 0 Contention time                =      3411439
Dispatch 2 Threads always contenting time =    708073977
pthread_once time  =      2707770
dispatch_once time =        87433

On my MacbookPro (Snow Leopard 10.6.4):

  Model Identifier: MacBookPro6,2
  Processor Name:   Intel Core i5
  Processor Speed:  2.4 GHz
  Number Of Processors: 1
  Total Number Of Cores:    2 (though HT is enabled)
  L2 Cache (per core):  256 KB
  L3 Cache: 3 MB
  Memory:   8 GB
  Processor Interconnect Speed: 4.8 GT/s

I got:

0 Contention time                         =     74172042
2 Threads always contenting time          =     82975742
2 NSTasks always contenting time          =     82996716
0 Contention named semaphore time         =    106772641
2 Threads named semaphore contention time =    162761973
2 NSTasks named semaphore contention time =    162919844
Dispatch 0 Contention time                =      1634941
Dispatch 2 Threads always contenting time =    759753865
pthread_once time  =      1516787
dispatch_once time =       120778

on an iPhone 3GS 4.0.2 I got:


0 Contention time                         =      5971929
2 Threads always contenting time          =     11989710
2 NSTasks always contenting time          =     11950564
0 Contention named semaphore time         =     16721876
2 Threads named semaphore contention time =     35333045
2 NSTasks named semaphore contention time =     35296579
Dispatch 0 Contention time                =       151909
Dispatch 2 Threads always contenting time =     46946548
pthread_once time  =       193592
dispatch_once time =        25071

Questions and statements:

  • sem_wait() and sem_post() are slow when not under contention
    • why is this the case?
    • does OSX not care about compatible APIs? is there some legacy code that forces this to be slow?
    • Why aren't these numbers the same as the dispatch_semaphore functions?
  • sem_wait() and sem_post() are just as slow when under contention as when they are not (there is a difference but I thought that it would be a huge difference between under contention and not; I expected numbers like what was in the dispatch_semaphore code)
  • sem_wait() and sem_post() are slower when using named semaphores.
    • Why? is this because the semaphore has to be synced between processes? maybe there is more baggage when doing that.
  • dispatch_semaphore_wait() and dispatch_semaphore_signal() are crazy fast when not under contention (no surprise here since apple is touting this a lot).
  • dispatch_semaphore_wait() and dispatch_semaphore_signal() are 3x slower than sem_wait() and sem_post() when under contention
    • Why is this so slow? this does not make sense to me. I would have expected this to be on par with the sem_t under contention.
  • dispatch_once() is faster than pthread_once(), around 10x, why? The only thing I can tell from the headers is that there is no function call burden with dispatch_once() than with pthread_once().

Motivation: I am presented with 2 sets of tools to get the job done for semaphores or once calls (I actually found other semaphore variants in the meantime, but I will ignore those unless brought up as a better option). I just want to know what is the best tool for the job (If you have the option for screwing in a screw with a philips or flathead I would choose philips if I don't have to torque the screw and flathead if I have to torque the screw).
It seems that if I start writing utilities with libdispatch I might not be able to port them to other operating systems that do not have libdispatch working yet... but it is so enticing to use ;)

As it stands: I will be using libdispatch when I don't have to worry about portability and POSIX calls when I do.

Thanks!

+1  A: 

sem_wait() and sem_post() are heavy weight synchronization facilities that can be used between processes. They always involve round trips to the kernel, and probably always require your thread to be rescheduled. They are generally not the right choice for in-process synchronization. I'm not sure why the named variants would be slower than the anonymous ones...

Mac OS X is actually pretty good about Posix compatibility... But the Posix specifications have a lot of optional functions, and the Mac doesn't have them all. Your post is actually the first I've ever heard of pthread_barriers, so I'm guessing they're either relatively recent, or not all that common. (I haven't paid much attention to pthreads evolution for the past ten years or so.)

The reason the dispatch stuff falls apart under forced extreme contention is probably because under the covers the behavior is similar to spin locks. Your dispatch worker threads are very likely wasting a good chunk of their quanta under the optimistic assumption that the resource under contention is going to be available any cycle now... A bit of time with Shark would tell you for sure. The take-home point, though, should be that "optimizing" the thrashing during contention is a poor investment of programmer time. Instead spend the time optimizing the code to avoid heavy contention in the first place.

If you really have a resource that is an un-avoidable bottleneck within your process, putting a semaphore around it is massively sub-optimal. Put it on its own serial dispatch queue, and as much as possible dispatch_async blocks to be executed on that queue.

Finally, dispatch_once() is faster than pthread_once() because it's spec'd and implemented to be fast on current processors. Probably Apple could speed up the pthread_once() implementation, as I suspect the reference implementation uses pthread synchronization primitives, but... well... they've provided all of the libdispatch goodness instead. :-)

Kaelin Colclasure
Good point about the sem_wait/post since dispatch_semaphores do not have to deal with the context switch to the kernel (it seems like a duh! now ;). I was responsible for adding POSIX compatibility to a home grown kernel for embedded systems and have found barriers useful for creating unit tests. I was not trying to optimize one particular situation over another, but rather try to figure out of the 2 tools I am given which is the best tool for the job (if I have the option of screwing in a screw with a philips or flathead... I will use philips). Updating question with motivation...
Brent Priddy
pthread_once() can/should be implemented with atomics for the "called" check; when waiting for the call to complete yes I agree it would be using a pthread_mutex to block other threads (that is how I did it). In this test case though, there is no blocking and no need for a kernel context switch. With this said I still don't understand why there is a 10x difference. I guess the libdispatch is optimized more than posix calls.
Brent Priddy
another observation: anonymous semaphores should operate like the dispatch_semaphores since you cant retrieve them from another process. You should only have to context switch to the kernel when you actually have to block or wake up blocked threads (given that apple is using atomics for semaphores, which is what I did for our semaphores in our OS).
Brent Priddy
These may be using Mach semaphores on Darwin, either for interoperability or simply because they were already there…
Kaelin Colclasure