views:

183

answers:

4

Hello *,

I am currently writing a scientific article, where I need to be very exact with citation. Can someone point me to either MSDN, MSDN article, some published article source or a book, where I can find performance comparison of Windows or .NET Synchronization primitives.

I know that these are in the descending performance order: Interlocked API, Critical Section, .NET lock-statement, Monitor, Mutex, EventWaitHandle, Semaphore.

Many Thanks,
Ovanes

P.S. I found a great book: Concurrent Programming on Windows by Joe Duffy. This book is written by one of the head concurrency developers for .NET Framework and is simply brilliant with lots of explanations, how things work or were implemented.

+2  A: 

I doubt you'll find direct numbers on these - they vary based on the underlying OS and CPU, as well as in different situations.

It's odd to compare the performance of these primitives since they do different things - a EventWaitHandle has different behavior than a critical section, therefore you can't directly compare their performance. Also, you'll find that in different situations they perform differently - a critical section is faster than a mutex for an uncontended acquire, but will be similar in performance in the face of contention. Some of these primitives may perform horribly in the face of heavy contention where others will scale much better.

I recommend creating a test program to measure the a performance - it should not take too long to write and measure the performance of each of these primitives, and you'll be able to answer any questions about the numbers in your paper.

Michael
+2  A: 

The behaviour is:

  1. Not a simple descending list since some do more work than others.
  2. Varies in cost depending on both the CPU architecture you are running on, the number of cores in the system and the version of windows.

Some notes:

  • the lock statement is syntactic sugar for the Monitor class.
  • Many of these are incredibly thin wrappers round the underlying win32 api calls, often directly with P/Invoke. Some of which are themselves in turn thin wrappers on a few cpu instructions.

The lower the level of the instruction the more significant the difference from the low level hardware. For example the cache locking and invalidation routines in cpu's within the same package/NUMA node can be much faster than those in older FSB style SMP systems.

ShuggyCoUk
+1  A: 

Finding specific numbers is difficult and I would strongly encourage you to test out the locks in your scenario, because the perf will depend on access ratios, contention patterns and the hardware it is being run on. I also encourage you to include the spin locks in .NET 4.0 in your comparison like System.Threading.SpinLock and System.Threading.SemaphoreSlim.

That being said Joe Duffy has several posts on his blog which compare perf of particular locks, for example this one.

Rick
+1  A: 

For a rough comparison following numbers from Lockless Programming Considerations for Xbox 360 and Microsoft Windows may come handy.


The performance of synchronization instructions and functions on Windows vary widely depending on the processor type and configuration, and on what other code is running. Multi-core and multi-socket systems often take longer to execute synchronizing instructions, and acquiring locks take much longer if another thread currently owns the lock.

However, even some measurements generated from very simple tests are helpful:

  • MemoryBarrier was measured as taking 20-90 cycles.
  • InterlockedIncrement was measured as taking 36-90 cycles.
  • Acquiring or releasing a critical section was measured as taking 40-100 cycles.
  • Acquiring or releasing a mutex was measured as taking about 750-2500 cycles.

These tests were done on Windows XP on a range of different processors. The short times were on a single-processor machine, and the longer times were on a multi-processor machine.

Suma
Thanks a lot for point this out. A very interesting article with lots of insights!
ovanes