I've encountered a strange behavior in a .NET application that performs some highly parallel processing on a set of in-memory data.
When run on a multi-core processor (IntelCore2 Quad Q6600 2.4GHz) it exhibits non-linear scaling as multiple threads are kicked off to process the data.
When run as a non-multithreaded loop on a single core, the process is able to complete approximately 2.4 million computations per second. When run as four threads you would expect four times as much throughput - somewhere in the neighborhood of 9 million computations per second - but alas, no. In practice it only completes about 4.1 millions per second ... quite a bit short from the expected throughput.
Furthermore, the behavior occurs no matter whether I use PLINQ, a thread pool, or four explicitly created threads. Quite strange...
Nothing else is running on the machine using CPU time, nor are there any locks or other synchronization objects involved in the computation ... it should just tear ahead through the data. I've confirmed this (to the extent possible) by looking at perfmon data while the process runs ... and there are no reported thread contentions or garbage collection activity.
My theories at the moment:
- The overhead of all of the techniques (thread context switches, etc) is overwhelming the computations
- The threads are not getting assigned to each of the four cores and spend some time waiting on the same processor core .. not sure how to test this theory...
- .NET CLR threads are not running at the priority expected or have some hidden internal overhead.
Below is a representative excerpt from the code that should exhibit the same behavior:
var evaluator = new LookupBasedEvaluator();
// find all ten-vertex polygons that are a subset of the set of points
var ssg = new SubsetGenerator<PolygonData>(Points.All, 10);
const int TEST_SIZE = 10000000; // evaluate the first 10 million records
// materialize the data into memory...
var polygons = ssg.AsParallel()
.Take(TEST_SIZE)
.Cast<PolygonData>()
.ToArray();
var sw1 = Stopwatch.StartNew();
// for loop completes in about 4.02 seconds... ~ 2.483 million/sec
foreach( var polygon in polygons )
evaluator.Evaluate(polygon);
s1.Stop();
Console.WriteLine( "Linear, single core loop: {0}", s1.ElapsedMilliseconds );
// now attempt the same thing in parallel using Parallel.ForEach...
// MS documentation indicates this internally uses a worker thread pool
// completes in 2.61 seconds ... or ~ 3.831 million/sec
var sw2 = Stopwatch.StartNew();
Parallel.ForEach(polygons, p => evaluator.Evaluate(p));
sw2.Stop();
Console.WriteLine( "Parallel.ForEach() loop: {0}", s2.ElapsedMilliseconds );
// now using PLINQ, er get slightly better results, but not by much
// completes in 2.21 seconds ... or ~ 4.524 million/second
var sw3 = Stopwatch.StartNew();
polygons.AsParallel(Environment.ProcessorCount)
.AsUnordered() // no sure this is necessary...
.ForAll( h => evalautor.Evaluate(h) );
sw3.Stop();
Console.WriteLine( "PLINQ.AsParallel.ForAll: {0}", s3.EllapsedMilliseconds );
// now using four explicit threads:
// best, still short of expectations at 1.99 seconds = ~ 5 million/sec
ParameterizedThreadStart tsd = delegate(object pset) { foreach (var p in (IEnumerable<Card[]>) pset) evaluator.Evaluate(p); };
var t1 = new Thread(tsd);
var t2 = new Thread(tsd);
var t3 = new Thread(tsd);
var t4 = new Thread(tsd);
var sw4 = Stopwatch.StartNew();
t1.Start(hands);
t2.Start(hands);
t3.Start(hands);
t4.Start(hands);
t1.Join();
t2.Join();
t3.Join();
t4.Join();
sw.Stop();
Console.WriteLine( "Four Explicit Threads: {0}", s4.EllapsedMilliseconds );