views:

140

answers:

4

Hi,

I've recently wrote a C# console time tabling algorithm that is based on a combination of a genetic algorithm with a few brute force routines thrown in. The initial results were promising but I figured I could improve the performance by splitting the brute force routines up to run in parallel on multi processor architectures. To do this I used the well documented Producer/Consumer model (as documented in this fantastic article http://www.albahari.com/threading/part2.aspx#_ProducerConsumerQWaitHandle). I changed my code to create one thread per logical processor during the brute force routines.

The performance gains on my work station were very pleasing. I am running Windows XP on the following hardware:

Intel Core 2 Quad CPU 2.33 GHz 3.49 GB RAM

Initial tests indicated average performance gains of approx 40% when using 4 threads. The next step was to deploy the new multi-threading version of the algorithm to our higher spec UAT server. Here is the spec of our UAT server:

Windows 2003 Server R2 Enterprise x64 8 cpu (Quad-Core) AMD Opteron 2.70 GHz 255 GB RAM

After running the first round of tests we were all extremely surprised to find that the algorithm actually runs slower on the high spec W2003 server than on my local XP work station! In fact the tests seem to indicate that it doesn't matter how many threads are generated (tests were ran with the app spawning between 2 to 32 threads). The algorithm always runs significantly slower on the UAT W2003 server?

How could this be? Surely the app should run faster on a 8 cpu (Quad-Core) than my 2 Quad work station? Why are we seeing no performance gains with the multi-threading on the W2003 server whilst the XP workstation tests show gains of up to 40%?

Any help or pointers would be appreciated.

Regards

Myles

+1  A: 

You need to find out where it's spending its time. Could it be something silly like very slow console writes?

It sounds like you're changing between an x86 and x64 platform too, but you don't say how your .NET app is compiled - is it running as 32- or 64-bit on the x64 machine?

Will Dean
I compiled a x64 version for the w2003 server. It made no difference in performance.
Myles J
+1  A: 

It very much depends on your code and the OS. It is impossible to answer your Q without examining code. It is easy to get multi-threading wrong.

Hamish Grubijan
A: 

A 40% total speedup implies that either your algorithm is memory bandwidth constrained, or you are doing far too much synchronization. A profiler can help in each case.

Each call to wait for more data to process is expensive. Ideally, the amount of CPU time spent waiting for new data or performing the synchronized locks/unlocks is tiny. The simple way of ensuring this is making your processing payloads as "large" as possible.

As for the slowdown on your production system - profile it. There are numerous variables here.

Yann Ramin
How did you come to the conclusion that it is memory bound? or cache-line bound?
GrayWizardx
Barring synchronization overhead, a cache-friendly algorithm will scale near 1:1 with the number of processors available. If the algorithm is memory bandwidth bound (or cache unfriendly), the total gain per processor will plateau early on. The odd part is the Opteron generally has larger memory bandwidth available in a multi-processor situation due to NUMA support.
Yann Ramin
+1  A: 

My guess, (which is limited given the lack of information) is that you may be experiencing problems due to true sharing, or more likely, false sharing.

False sharing can easy cause algorithms to slow down as more cores are added, due to the excessive cache hits. If your server has a larger cache line size, this makes it more likely to occur.

I, in particular, suspect this may be the problem - particularly because you're only getting a 40% boost on 4 threads vs. 1. Often, you'll get a certain amount of scalability up to a low threshold of threads, then start getting cache hit misses that cause the perf. to drop dramatically. This may be the issue.

Reed Copsey
+1 for commenting that more threads =/= faster/better performance. :)
GrayWizardx
This is the most feasible explanation that I've read so far. Any suggestions on profiling tools/techniques to prove this?Many thanks
Myles J
I've just read some articles about false sharing and cache line sizes. I quote:"Generally, the solution is to pack data tightly, give each thread its own private copy to work on, and merge results afterwards. This strategy extends naturally to task stealing. When a thread steals a task, it can clone the shared data structures that might cause cache line ping ponging, and merge the results later."This is the approach that I am already using. Each spawned thread has its own copy of cloned data. The brute force results are merged back when the thread completes execution.
Myles J
If that's the case, it very well may be the synchronization during the "merge" and "split" operations. Try making each workload as large as possible. VS2010's profiler is amazing at detecting this sort of situation, btw - and free ATM. Otherwise, tools like V-Tune (Intel) work very well - and even basic code profilers.
Reed Copsey
Also - make sure your copy of the data is allocated on the thread that will use it. If you allocate each copy on teh main thread, it's very easy to make the accidental mistake of having cache lines next to each other - also, watch for arrays - any array that is written to on multiple threads will cause a cache line hit, due to the bounds checking...
Reed Copsey
I'm allocating cloned data objects to a queue that will be accessed by the spawned consumer threads. This occurs on the main thread.Is this what you meant by "make sure your copy of the data is allocated on the thread that will use it"?
Myles J
Depending on the size and implementation of your queue, and which thread is doing the copying, it's possible that you get false sharing in your data packets. Make sure each thread does it's own copying (ie: copy on the consumer, not producer side), as that tends to help. Locking on the queue access/data sharing/merging may be killing you, too. VS2010 is GREAT for tracking down these sort of issues.. I'd try grabbing a copy (it's free ATM) and trying the concurrency profiler out.
Reed Copsey