ansaurus

Question

Non-linear scaling of .NET operations on multi-core machine

Answer 1

A:

I certainly would not expect a linear relationship, but I would have thought you would have seen a bigger gain than that. I am assuming that the CPU usage is maxed out on all cores. Just a couple of thoughts off the top of my head.

Are you using any shared data structures (either explicitly or implicitly) that require synchronization?
Have you tried profiling or recording performance counters to determine where the bottleneck is? Can you give any more clues.

Edit: Sorry, I just noticed that you have already addressed both of my points.

Brian Gideon 2009-09-20 00:31:32

Indeed... For instance could you be maxing your memory throughput?

spender 2009-09-20 00:34:07

An interesting idea @spender, what performance counters could I examine to determine if I am indeed maxing memory throughput?

LBushkin 2009-09-20 01:11:21

If you're maxing your memory throughput without maxing your CPU usage, the easiest way to know is the CPU usage would not be at 100%...

configurator 2009-09-20 03:03:37

Answer 2

+3 A:

Non-linear scaling is to be expected with a parallel algorithm in comparison with a sequential algorithm, since there is some inherent overhead in the parallelization. ( Ideally, of course, you want to get as close as you can.)

Additionally, there will usually be certain things you need to take care of in a parallel algorithm that you don't need to in a sequential algorithm. Beyond synchronization (which can really bog down your work), there are some other things that can happen:

The CPU and OS can't devote all of its time to your application. Thus, it needs to do context switching every now and again to let other processes get some work done. When you're only using a single core, it is less likely that your process is switched out, because it has three other cores to choose from. Note that even though you might think nothing else is running, the OS or some services could still be performing some background work.
If each of your threads is accessing a lot of data, and this data is not common between threads, you will most likely not be able to store all of this in the CPU cache. That means a lot more memory accessing is required, which is (relatively) slow.

As far as I can tell, your current explicit approach uses a shared iterator between the threads. That's an okay solution if the processing vary wildly throughout the array, but there is likely to be synchronization overhead to prevent an element from being skipped (retrieving the current element and moving the internal pointer to the next element needs to be an atomic operation to prevent skipping an element).

Therefore, it might be a better idea to partition the array, assuming the processing time of each element is expected to be roughly equal regardless of the position of the element. Given that you have 10 million records, that means telling thread 1 to work on elements 0 to 2,499,999, thread 2 works on elements 2,500,000 to 4,999,999, etc. You can assign each thread an ID and use this to calculate the actual range.

Another small improvement would be to let the main thread act as one of the threads that calculates. However, if I remember correctly, that's a very minor thing.

Michael Madsen 2009-09-20 00:58:33

Answer 3

+5 A:

Take a look at this article: http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx

Specifically, limit memory allocations in the parallel region, and carefully inspect writes to make sure that they don't occur close to memory locations that other threads read or write.

Igor ostrovsky 2009-09-20 19:03:00

and you should also be able to profile this and see what's going on in the concurrency views of the VS 2010 profiler. Here is their teams' blog http://blogs.msdn.com/visualizeparallel/default.aspx

Rick 2010-01-16 21:12:02

Answer 4

+2 A:

So I finally figured out what the problem was - and I think it would be useful to share it with the SO community.

The entire issue with non-linear performance was the result of a single line inside the Evaluate() method:

var coordMatrix = new long[100];

Since Evaluate() is invoked millions of times, this memory allocation was occurring millions of times. As it happens, the CLR internally performs some inter-thread synchronization when allocating memory - otherwise allocation on multiple threads could inadvertently overlap. Changing the array from a method-local instance to a class instance that is only allocated once (but then initializing in a method-local loop) eliminated the scalability problem.

Normally, it's an antipattern to create a class-level member for a variable that is only used (and meaningful) within the scope of a single method. But in this case, since I need the greatest possible scalability, I will live with (and document) this optimization.

Epilogue: After I made this change, the concurrent process was able to achieve 12.2 million computations / sec.

P.S. Kudos to Igor Ostrovsky for his germane link to MSDN blogs which helped me identify and diagnose the problem.

LBushkin 2009-09-21 03:10:23

You could use stack allocation, which avoids the problem of allocation altogether, although your way works just fine :)

Martin 2009-10-11 17:15:22

This could have also been solved by using a resource pool. This question helped me understand why pools can be important when attempting to perform massive parallel operations.

Will 2010-01-15 15:50:20

Sp your original non-parallel implementation runs at 2.4Mop/s, your latest parallelised version on 4 cores runs at 12.2Mop/s. That's super-linear scaling, which is remarkable and worthy of investigation. You did retest the single core execution of the code again after you made the change didn't you ?

High Performance Mark 2010-01-16 12:54:17

Changing the memory allocation improved the single-core performance to 3.2 Mops, so the 12.2 4-core results is reasonable.

LBushkin 2010-01-16 19:29:29

Answer 5

A:

I've asked a similar question here titled "Why doesn’t my threaded .Net app scale linearly when allocating large amounts of memory?"

http://stackoverflow.com/questions/2072752/why-doesnt-my-threaded-net-app-scale-linearly-when-allocating-large-amounts-of

Luke Venediger 2010-01-16 12:16:43

ansaurus

tags:

views:

answers:

Non-linear scaling of .NET operations on multi-core machine

related questions