views:

2517

answers:

8

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.

So, I'd like to ask:

  • do you have any experience with performance of single-threaded apps that use locking versus those that don't?
  • how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
  • do you know better ways to protect such variable accesses?
+3  A: 

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.

It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.

do you know better ways to protect such variable accesses?

Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

jalf
Yeah, that's a good point, no need to lock data that's not exposed. My strategy is almost always:1. all data is private2. all public/protected methods that aren't static lock on the way in3. locks are objects on the stack so that uncaught exceptions unlock the data.
are you saying to lock in *all* public/protected methods? That's silly. Lock in the ones it makes sense to call from other threads, and nothing else.
jalf
The only way to get correct behavior in a multithreaded library is to expect users of the library to use it correctly. Spamming locks at everything that people *can* misuse doesn't solve the problem or ensure correctness. Let the user know what may, and what may not, be called from other threads, and then lock *those* methods only.
jalf
When you say "called from other threads", do you mean "called concurrently from multiple threads"? Restrictions on concurrency are a different thing from restrictions on context - an object with something to do with OpenGL might for example "not be callable from other threads" than the one to which the OpenGL context is bound. This doesn't mean it needs external synchronisation, it means you can't use it at all except from one particular thread.
Steve Jessop
@jalf I'm assuming that all public/protected methods on the interface are for use by the library client. If a method is for internal use, it should be hidden from the client. If a method isn't using class fields, it can be static and doesn't need a lock.There's no point to a client having access to methods it doesn't need or could misuse. It's simple enough to make those private or implement them in a class that's not exposed to the client system and wouldn't need additional locking.It's easy to expect users to use the lib correctly if that's there only option. :-) Nobody reads the docs.
drr... "their" only option
ugh, context switch. Don't use process shared mutexes on windows. They are hella slow. A context switch causes cache thrashing as well as burning CPU time. Use a critical section, it used processor atomics to check locking on only goes to the OS when there is lock contention. This is what Win32-Pthreads does unless you specifically ask for a process shared mutex. So mutexes are fast usually a only few instructions (one) in the common case. On linux mutexes are always really fast, shared or not.
caspin
Actually... IIRC, on recent linuxes (which is half of the question) I think pthread_mutex_lock does not require a context switch most of the times as they are implemented around futexes (see http://en.wikipedia.org/wiki/Futex ) by default.
Fredrik
Sorry - I'm coming from an RTOS background, not Linux, so the answer might be different... but if the mutex is available, will it still require a context switch? Under every RTOS I know, there won't be a context switch (actually 2) unless the mutex is currently unavailble.
Dan
+3  A: 

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)

I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.

An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:

#ifdef BUILD_SINGLE_THREAD
    inline void lock () {}
    inline void unlock () {}
#else
    inline void lock () { doSomethingReal(); }
    inline void unlock () { doSomethingElseReal(); }
#endif

Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

using locks doesn't guarantee "getting it right". It might lead to deadlocks all over the place.
jalf
If the library in question doesn't call back into the user and doesn't lock anything else, then deadlocks are not a possibility.
caf
+1  A: 

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.

However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.

Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

gbjbaanb
A: 

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.

In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.

If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

drhirsch
You can never guarantee a read or write is atomic on a multicore system without processor help ie __sync_fetch*. On a single core machine it is safe to assume machine word read/writes are atomic, but single cores are going the way of dodo.
caspin
@Caspin, _reading_ a dword is definitely an atomic operation in a sense that you're never going to get a mix of some "old" bits and some "new" bits even if another core is writing to the same memory location at the same time (but, for example, a qword is _not_ atomic on x86 in the same sense). I think you're confusing "atomic" with "ordered".
Pavel Minaev
@Caspin, you are wrong. Quoting from Intel® 64 Architecture Memory Ordering White Paper: "Intel 64 memory ordering guarantees that for each of the following memory-accessinstructions, the constituent memory operation appears to execute as a single memory accessregardless of memory type:...3. Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4byte boundary.4. Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8byte boundary."So, among lots of others, this is atomic. BTW, Intel 64 refers to both IA-32 and 64.
drhirsch
+1  A: 

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

JDonner
Great job getting to root of the problem. Encapsulate everything as much as possible when they are touched by multiple threads. Accessors weaken encapsulation.
caspin
+8  A: 

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.

Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.

And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

cmeerw
A: 

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.

Ex.

#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif

#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif

Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.

Again I would NOT use this method in any large project. But only if you want something fairly simple.

James
+1  A: 

I was curious about the expense of using a pthred_mutex_lock/unlock. I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using a mutex or to use a mutex and do a single write of a pointer to the data needed.

I wrote a short loop to test each gettimeofday(&starttime, NULL) COPY DATA gettimeofday(&endtime, NULL) timersub(&endtime, &starttime, &timediff) print out timediff data

or ettimeofday(&starttime, NULL) pthread_mutex_lock(&mutex); gettimeofday(&endtime, NULL) pthread_mutex_unlock(&mutex); timersub(&endtime, &starttime, &timediff) print out timediff data

If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.

The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for the gettimeofday for the currentTime which took about 2 usec