views:

346

answers:

5

I have c# Console app, Monte Carlo simulation entirely CPU bound, execution time is inversely proportional to the number of dedicated threads/cores available (I keep a 1:1 ratio between cores/threads).

It currently runs daily on:

AMD Opteron 275 @ 2.21 GHz (4 core)

The app is multithread using 3 threads, the 4th thread is for another Process Controller app.

It takes 15 hours per day to run.

I need to estimate as best I can how long the same work would take to run on a system configured with the following CPU's:

http://en.wikipedia.org/wiki/Intel_Nehalem_(microarchitecture)
2 x X5570
2 x X5540

and compare the cases, I will recode it use the available threads. I want to justify that we need a Server with 2 x x5570 CPUs over the cheaper x5540 (they support 2 cpus on a single motherboard). This should make available 8 cores, 16 threads (that's how the Nehalem chips work I believe) to the operating system. So for my app that's 15 threads to the Monte Carlo Simulation.

Any ideas how to do this? Is there a website I can go and see benchmark data for all 3 CPUS involved for a single threaded benchmark? I can then extrapolate for my case and number of threads. I have access to the current system to install and run a benchmark on if necessary.

Note the business are also dictating the workload for this app over the next 3 months will increase about 20 times and needs to complete in a 24 hour clock.

Any help much appreciated.

Have also posted this here: http://www.passmark.com/forum/showthread.php?t=2308 hopefully they can better explain their benchmarking so I can effectively get a score per core which would be much more helpful.

A: 

I'm going to go out on a limb and say that even the dual-socket X5570 will not be able to scale to the workload you envision. You need to distribute your computation across multiple systems. Simple math:

Current Workload

3 cores * 15 real-world-hours = 45 cpu-time-hours

Proposed 20X Workload

45 cpu-time-hours * 20 = 900 cpu-time-hours
900 cpu-time-hours / (20 hours-per-day-per-core) = 45 cores

Thus, you would need the equivalent of 45 2.2GHz Opteron cores to achieve your goal (despite increasing processing time from 15 hours to 20 hours per day), assuming a completely linear scaling of performance. Even if the Nehalem CPUs are 3X faster per-thread you will still be at the outside edge of your performance envelop - with no room to grow. That also assumes that hyper-threading will even work for your application.

The best-case estimates I've seen would put the X5570 at perhaps 2X the performance of your existing Opteron.

Source: http://www.dailytech.com/Server+roundup+Intel+Nehalem+Xeon+versus+AMD+Shanghai+Opteron/article15036.htm

Will Bickford
no current workload is 3 cores total = 15 hours. If there was 1 core it would take 45 hours
m3ntat
That's what I'm saying - your total workload is currently 45 hours worth of 1 core.
Will Bickford
Yeah sorry I misinterpreted you. I agree there is roughly 900 hours worth of single threaded work to complete. If I can get 15 threads running this app = 900/15 = 60 hours work on the current hardware. If Nehalem is 3x faster per thread that is 20 hours work per day which is getting really high, I really need to know what that mulitplier is, is it at 3x or how to find out?
m3ntat
+2  A: 

have you considered recreating the algorithm in cuda? It uses current day GPU's to increase calculations like these 10-100 fold. This way you just need to buy a fat videocard

Toad
recoding for CUDA is not an option at this stage unless I can run c# code easily under CUDA?
m3ntat
cuda is based on c. I don't know how big your algorithm is, but it might be worth the trouble to port it. A speed up of a factor 10-50 seems to me a huge incentive
Toad
A: 

tomshardware.com contains a comprehensive list of CPU benchmarks. However... you can't just divide them, you need to find as close to an apples to apples comparison as you can get and you won't quite get it because the mix of instructions on your workload may or may not depend.

I would guess please don't take this as official, you need to have real data for this that you're probably in the 1.5x - 1.75x single threaded speedup if work is cpu bound and not highly vectorized.

You also need to take into account that you are: 1) using C# and the CLR, unless you've taken steps to prevent it GC may kick in and serialize you. 2) the nehalems have hyperthreads so you won't be seeing perfect 16x speedup, more likely you'll see 8x to 12x speedup depending on how optimized your code is. Be optimistic here though (just don't expect 16x). 3) I don't know how much contention you have, getting good scaling on 3 threads != good scaling on 16 threads, there may be dragons here (and usually is).

I would envelope calc this as:

15 hours * 3 threads / 1.5 x = 30 hours of single threaded work time on a nehalem.

30 / 12 = 2.5 hours (best case)

30 / 8 = 3.75 hours (worst case)

implies a parallel run time if there is truly a 20x increase: 2.5 hours * 20 = 50 hours (best case)

3.74 hours * 20 = 75 hours (worst case)

How much have you profiled, can you squeeze 2x out of app? 1 server may be enough, but likely won't be.

And for gosh sakes try out the task parallel library in .Net 4.0 or the .Net 3.5 CTP it's supposed to help with this sort of thing.

-Rick

Rick
Thanks rick I've tried out the Parallels library, yes it is good and it works but doesn't allow me to throttle to a max number of threads which I need because there are other apps on this box which must run with become too sluggish, that's why I will make n-1 threads available to this app, most likely 15 threads (best case as you say).
m3ntat
I've been through the app with a fine tooth come using, redgate profiling, dottrace, ibm purify/quantify etc and optimised everything software wise, optimised my mersenne twister. As much as can be done in this are has been done. It is now down to hardware.
m3ntat
I've used the x5570 the best speedup I've seen is 12x for non SSE code. Keep in mind if you optimize it *more* because it is hyper-threaded if the instruction pipeline is full speedup will go down towards 8x.There is usually opportunity to vectorize monte carlos via SSE which has the potential for another massive perf win (4x per core), but that would imply moving outside of C# at least for the kernel.
Rick
Hi @Rick, what do you mean by vectorize via SSE? can I do this in C#, can you provide a link to an article, tutorial? thanks
m3ntat
A: 

Finding a single-box server which can scale according to the needs you've described is going to be difficult. I would recommend looking at Sun CoolThreads or other high-thread count servers even if their individual clock speeds are lower. http://www.sun.com/servers/coolthreads/overview/performance.jsp

The T5240 supports 128 threads: http://www.sun.com/servers/coolthreads/t5240/index.xml

Memory and CPU cache bandwidth may be a limiting factor for you if the datasets are as large as they sound. How much time is spent getting data from disk? Would massively increased RAM sizes and caches help?

You might want to step back and see if there is a different algorithm which can provide the same or similar solutions with fewer calculations.

It sounds like you've spent a lot of time optimizing the the calculation thread, but is every calculation being performed actually important to the final result?

Is there a way to shortcut calculations anywhere?

Is there a way to identify items which have negligible effects on the end result, and skip those calculations?

Can a lower resolution model be used for early iterations with detail added in progressive iterations?

Monte Carlo algorithms I am familiar with are non-deterministic, and run time would be related to the number of samples; is there any way to optimize the sampling model to limit the number of items examined?

Obviously I don't know what problem domain or data set you are processing, but there may be another approach which can yield equivalent results.

ryandenki
A: 

It'd be swinging big hammer, but perhaps it makes sense to look at some heavy-iron 4-way servers. They are expensive, but at least you could get up to 24 physical cores in a single box. If you've exhausted all other means of optimization (including SIMD), then it's something to consider.

I'd also be weary of other bottlenecks such as memory bandwidth. I don't know the performance characteristics of Monte Carlo Simulations, but ramping up one resource might reveal some other bottleneck.

exabytes18