views:

122

answers:

3

Hi,

In a low level language (C, C++ or whatever): I have the choice in between either having a bunch of mutexes (like what pthread gives me or whatever the native system library provides) or a single one for an object.

How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?

How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?

(I am not sure how much differences there are between different hardware. If there is, I would also like to know about them. But mostly, I am interested about common hardware.)

The point is, by using many mutex which each cover only a part of the object instead of a single mutex for the whole object, I could safe many blocks. And I am wondering how far I should go about this. I.e. should I try to safe any possible block really as far as possible, no matter how much more complicated and how much more mutexes this means?

Thanks, Albert

+1  A: 

The cost will vary depending on the implementation but you should keep in mind two things:

  • the cost will be most likely be minimal since it's both a fairly primitive operation and it will be optimised as much as possible due to its use pattern (used a lot).
  • it doesn't matter how expensive it is since you need to use it if you want safe multi-threaded operation. If you need it, then you need it.

On single processor systems, you can generally just disable interrupts long enough to atomically change data. Multi-processor systems can use a test-and-set strategy.

In both those cases, the instructions are relatively efficient.

As to whether you should provide a single mutex for a massive data structure, or have many mutexes, one for each section of it, that's a balancing act.

By having a single mutex, you have a higher risk of contention between multiple threads. You can reduce this risk by having a mutex per section but you don't want to get into a situation where a thread has to lock 180 mutexes to do its job :-)

paxdiablo
Yea, but *how* efficient? Is it a single machine instruction? Or about 10? Or about 100? 1000? More? All of this is still efficient, however can make a difference in extreme situations.
Albert
Well, that depends _entirely_ on the implementation. You can turn off interrupts, test/set an integer and reactivate interrupts in a loop in about six machine instructions. Test-and-set can be done in about as many since the processors tend to provide that as a single instruction.
paxdiablo
A bus-locked test-and-set is a single (rather long) instruction on x86. The rest of the machinery to use it is pretty quick (“did the test succeed?” is a question that CPUs are good at doing fast) but it's the bus-locked instruction's length that really matters as it is the part that blocks things. Solutions with interrupts are much slower, because manipulating them is typically restricted to the OS kernel to stop trivial DoS attacks.
Donal Fellows
BTW, don't use drop/reacquire as a means for having a thread yield to others; that's a strategy that sucks on a multicore system. (It's one of the relatively-few things that CPython gets wrong.)
Donal Fellows
@Donal: What do you mean by drop/reacquire? That sounds important; can you give me more information on that?
Albert
@paxdiablo: Btw., yes, I need a mutex but I have the choice between using about 100s (or more) of them or a single one instead.
Albert
@Albert: No. You'll only try to go and implement it and your code will suck and you won't realize it. Why not do the sensible thing and stop worrying about this low-level thing? Just use the correct concurrency patterns that already exist and save yourself a whole bunch of grief. And stop worrying over how locks are implemented; worry over how they perform in practice *as measured by testing*.
Donal Fellows
@Donal: I didn't meant that I want to use it. I just want to know what you mean by that so I can get sure that I am not using it and that I can understand why it is a bad idea to use it. I was basically asking for references about that which give some background/details about it.
Albert
@Albert: Oh that. David Beazley has that covered in great depth. http://www.dabeaz.com/GIL/ and http://blip.tv/file/2232410 should be relevant.
Donal Fellows
+2  A: 

This depends on what you actually call "mutex", OS mode and etc.

At minimum it's a cost of an interlocked memory operation. It's a relatively heavy operation (compared to other primitive assembler commands).

However, the can be very much higher. If what you call "mutex" is a kernel object (i.e. - object managed by the OS) and you run in the user mode - every operation on it leads to a kernel mode transaction, which is very heavy.

For example on Intel Core Duo processor, Windows XP. Interlocked operation: takes about 40 CPU cycles. Kernel mode call (i.e. system call) - about 2000 CPU cycles.

If this is the case - you may consider using critical sections. It's a hybrid of a kernel mutex and interlocked memory access.

valdo
Windows critical sections are much closer to mutexes. They have regular mutex semantics, but they are process-local. The last part makes them a lot faster, as they can be handled entirely within your process (and thus user-mode code).
MSalters
+1  A: 

I have the choice in between either having a bunch of mutexes or a single one for an object.

If you have many threads and the access to the object happens often, then multiple locks would increase parallelism. At the cost of maintainability, since more locking means more debugging of the locking.

How efficient is it to lock a mutex? I.e. how much assembler instructions are there likely and how much time do they take (in the case that the mutex is unlocked)?

The precise assembler instructions are the least overhead of a mutex - the memory/cache coherency guarantees are the main overhead. And less often a particular lock is taken - better.

Mutex is made of two major parts (oversimplifying): (1) a flag indicating whether the mutex is locked or not and (2) wait queue.

Change of the flag is just few instructions and normally done without system call. If mutex is locked, syscall will happen to add the calling thread into wait queue and start the waiting. Unlocking, if the wait queue is empty, is cheap but otherwise needs a syscall to wake up one of the waiting processes. (On some systems cheap/fast syscalls are used to implement the mutexes, they become slow (normal) system calls only in case of contention.)

Locking unlocked mutex is really cheap. Unlocking mutex w/o contention is cheap too.

How much does a mutex cost? Is it a problem to have really a lot of mutexes? Or can I just throw as much mutex variables in my code as I have int variables and it doesn't really matter?

You can throw as much mutex variables into your code as you wish. You are only limited by the amount of memory you application can allocate.

Summary. User-space locks (and the mutexes in particular) are cheap and not subjected to any system limit. But too many of them spells nightmare for debugging. Simple table:

  1. Less locks means more contentions (slow syscalls, CPU stalls) and lesser parallelism
  2. Less locks means less problems debugging multi-threading problems.
  3. More locks means less contentions and higher parallelism
  4. More locks means more chances of running into undebugable deadlocks.

A balanced locking scheme for application should be found and maintained, generally balancing the #2 and the #3.


(*) The problem with less very often locked mutexes is that if you have too much locking in your application, it causes to much of the inter-CPU/core traffic to flush the mutex memory from the data cache of other CPUs to guarantee the cache coherency. The cache flushes are like light-weight interrupts and handled by CPUs transparently - but they do introduce so called stalls (search for "stall").

And the stalls are what makes the locking code to run slowly, often without any apparent indication why application is slow. (Some arch provide the inter-CPU/core traffic stats, some not.)

To avoid the problem, people generally resort to large number of locks to decrease the probability of lock contentions and to avoid the stall. That is the reason why the cheap user space locking, not subjected to the system limits, exists.

Dummy00001
Thanks, that mostly answers my question. I didn't knew that the kernel (e.g. the Linux kernel) handles mutexes and you control them via syscalls. But as the Linux itself manages the scheduling and context switches, this makes sense. But now I have a rough imagination about what the mutex lock/unlock will do internally.
Albert
@Albert: Oh. I forgot the context switches... Context switches are too drain on the performance. If lock acquisition *fails* and thread has to wait, that is too sort of half of the context switch. CS itself is fast, but since CPU might be used by some other process, the caches would be filled with alien data. After thread finally acquires the lock, chances are that to CPU would have to reload pretty much everything from RAM anew.
Dummy00001