views:

261

answers:

5

I've encountered a strange behavior in a .NET application that performs some highly parallel processing on a set of in-memory data.

When run on a multi-core processor (IntelCore2 Quad Q6600 2.4GHz) it exhibits non-linear scaling as multiple threads are kicked off to process the data.

When run as a non-multithreaded loop on a single core, the process is able to complete approximately 2.4 million computations per second. When run as four threads you would expect four times as much throughput - somewhere in the neighborhood of 9 million computations per second - but alas, no. In practice it only completes about 4.1 millions per second ... quite a bit short from the expected throughput.

Furthermore, the behavior occurs no matter whether I use PLINQ, a thread pool, or four explicitly created threads. Quite strange...

Nothing else is running on the machine using CPU time, nor are there any locks or other synchronization objects involved in the computation ... it should just tear ahead through the data. I've confirmed this (to the extent possible) by looking at perfmon data while the process runs ... and there are no reported thread contentions or garbage collection activity.

My theories at the moment:

  1. The overhead of all of the techniques (thread context switches, etc) is overwhelming the computations
  2. The threads are not getting assigned to each of the four cores and spend some time waiting on the same processor core .. not sure how to test this theory...
  3. .NET CLR threads are not running at the priority expected or have some hidden internal overhead.

Below is a representative excerpt from the code that should exhibit the same behavior:

    var evaluator = new LookupBasedEvaluator();

    // find all ten-vertex polygons that are a subset of the set of points
    var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);

    const int TEST_SIZE = 10000000;  // evaluate the first 10 million records

    // materialize the data into memory...
    var polygons = ssg.AsParallel()
                      .Take(TEST_SIZE)
                      .Cast<PolygonData>()
                      .ToArray();

    var sw1 = Stopwatch.StartNew();
    // for loop completes in about 4.02 seconds... ~ 2.483 million/sec
    foreach( var polygon in polygons )
        evaluator.Evaluate(polygon);
    s1.Stop(); 
    Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );

    // now attempt the same thing in parallel using Parallel.ForEach...
    // MS documentation indicates this internally uses a worker thread pool
    // completes in 2.61 seconds ... or ~ 3.831 million/sec
    var sw2 = Stopwatch.StartNew();
    Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
    sw2.Stop();
    Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );

    // now using PLINQ, er get slightly better results, but not by much
    // completes in 2.21 seconds ... or ~ 4.524 million/second
    var sw3 = Stopwatch.StartNew();
    polygons.AsParallel(Environment.ProcessorCount)
            .AsUnordered() // no sure this is necessary...
            .ForAll( h => evalautor.Evaluate(h) );
    sw3.Stop();
    Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );

    // now using four explicit threads:
    // best, still short of expectations at 1.99 seconds = ~ 5 million/sec
    ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
     var t1 = new Thread(tsd);
     var t2 = new Thread(tsd);
     var t3 = new Thread(tsd);
     var t4 = new Thread(tsd);

     var sw4 = Stopwatch.StartNew(); 
     t1.Start(hands);
     t2.Start(hands);
     t3.Start(hands);
     t4.Start(hands);
     t1.Join();
     t2.Join();
     t3.Join();
     t4.Join();
     sw.Stop();
     Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );
A: 

I certainly would not expect a linear relationship, but I would have thought you would have seen a bigger gain than that. I am assuming that the CPU usage is maxed out on all cores. Just a couple of thoughts off the top of my head.

  • Are you using any shared data structures (either explicitly or implicitly) that require synchronization?
  • Have you tried profiling or recording performance counters to determine where the bottleneck is? Can you give any more clues.

Edit: Sorry, I just noticed that you have already addressed both of my points.

Brian Gideon
Indeed... For instance could you be maxing your memory throughput?
spender
An interesting idea @spender, what performance counters could I examine to determine if I am indeed maxing memory throughput?
LBushkin
If you're maxing your memory throughput without maxing your CPU usage, the easiest way to know is the CPU usage would not be at 100%...
configurator
+3  A: 

Non-linear scaling is to be expected with a parallel algorithm in comparison with a sequential algorithm, since there is some inherent overhead in the parallelization. ( Ideally, of course, you want to get as close as you can.)

Additionally, there will usually be certain things you need to take care of in a parallel algorithm that you don't need to in a sequential algorithm. Beyond synchronization (which can really bog down your work), there are some other things that can happen:

  • The CPU and OS can't devote all of its time to your application. Thus, it needs to do context switching every now and again to let other processes get some work done. When you're only using a single core, it is less likely that your process is switched out, because it has three other cores to choose from. Note that even though you might think nothing else is running, the OS or some services could still be performing some background work.
  • If each of your threads is accessing a lot of data, and this data is not common between threads, you will most likely not be able to store all of this in the CPU cache. That means a lot more memory accessing is required, which is (relatively) slow.

As far as I can tell, your current explicit approach uses a shared iterator between the threads. That's an okay solution if the processing vary wildly throughout the array, but there is likely to be synchronization overhead to prevent an element from being skipped (retrieving the current element and moving the internal pointer to the next element needs to be an atomic operation to prevent skipping an element).

Therefore, it might be a better idea to partition the array, assuming the processing time of each element is expected to be roughly equal regardless of the position of the element. Given that you have 10 million records, that means telling thread 1 to work on elements 0 to 2,499,999, thread 2 works on elements 2,500,000 to 4,999,999, etc. You can assign each thread an ID and use this to calculate the actual range.

Another small improvement would be to let the main thread act as one of the threads that calculates. However, if I remember correctly, that's a very minor thing.

Michael Madsen
+5  A: 

Take a look at this article: http://blogs.msdn.com/pfxteam/archive/2008/08/12/8849984.aspx

Specifically, limit memory allocations in the parallel region, and carefully inspect writes to make sure that they don't occur close to memory locations that other threads read or write.

Igor ostrovsky
and you should also be able to profile this and see what's going on in the concurrency views of the VS 2010 profiler. Here is their teams' blog http://blogs.msdn.com/visualizeparallel/default.aspx
Rick
+2  A: 

So I finally figured out what the problem was - and I think it would be useful to share it with the SO community.

The entire issue with non-linear performance was the result of a single line inside the Evaluate() method:

var coordMatrix = new long[100];

Since Evaluate() is invoked millions of times, this memory allocation was occurring millions of times. As it happens, the CLR internally performs some inter-thread synchronization when allocating memory - otherwise allocation on multiple threads could inadvertently overlap. Changing the array from a method-local instance to a class instance that is only allocated once (but then initializing in a method-local loop) eliminated the scalability problem.

Normally, it's an antipattern to create a class-level member for a variable that is only used (and meaningful) within the scope of a single method. But in this case, since I need the greatest possible scalability, I will live with (and document) this optimization.

Epilogue: After I made this change, the concurrent process was able to achieve 12.2 million computations / sec.

P.S. Kudos to Igor Ostrovsky for his germane link to MSDN blogs which helped me identify and diagnose the problem.

LBushkin
You could use stack allocation, which avoids the problem of allocation altogether, although your way works just fine :)
Martin
This could have also been solved by using a resource pool. This question helped me understand why pools can be important when attempting to perform massive parallel operations.
Will
Sp your original non-parallel implementation runs at 2.4Mop/s, your latest parallelised version on 4 cores runs at 12.2Mop/s. That's super-linear scaling, which is remarkable and worthy of investigation. You did retest the single core execution of the code again after you made the change didn't you ?
High Performance Mark
Changing the memory allocation improved the single-core performance to 3.2 Mops, so the 12.2 4-core results is reasonable.
LBushkin
A: 

I've asked a similar question here titled "Why doesn’t my threaded .Net app scale linearly when allocating large amounts of memory?"

http://stackoverflow.com/questions/2072752/why-doesnt-my-threaded-net-app-scale-linearly-when-allocating-large-amounts-of

Luke Venediger