views:

787

answers:

3

I've been tasked with taking an existing single threaded monte carlo simulation and optimising it. This is a c# console app, no db access it loads data once from a csv file and writes it out at the end, so it's pretty much just CPU bound, also only uses about 50mb of memory.

I've run it through Jetbrains dotTrace profiler. Of total execution time about 30% is generating uniform random numbers, 24% translating uniform random numbers to normally distributed random numbers.

The basic algorithm is a whole lot of nested for loops, with random number calls and matrix multiplication at the centre, each iteration returns a double which is added to a results list, this list is periodically sorted and tested for some convergence criteria (at check points every 5% of total iteration count) if acceptable the program breaks out of the loops and writes the results, else it proceeds to the end.

I'd like developers to weigh in on:

  • should I use new Thread v ThreadPool
  • should I look at the Microsoft Parallels Extension library
  • should I look at AForge.Net Parallel.For, http://code.google.com/p/aforge/ any other libraries?

Some links to tutorials on the above would be most welcome as I've never written any parallel or multi-threaded code.

  • best strategies for generating en-mass normally distributed random numbers, and then consuming these. Uniform random numbers are never used in this state by the app, they are always translated to normally distributed and then consumed.
  • good fast libraries (parallel?) for random number generation
  • memory considerations as I take this parallel, how much extra will I require.

Current app takes 2 hours for 500,000 iterations, business needs this to scale to 3,000,000 iterations and be called mulitple times a day so need some heavy optimisation.

Particulary would like to hear from people who have used Microsoft Parallels Extension or AForge.Net Parallel

This needs to be productionised fairly quickly so .net 4 beta is out even though I know it has concurrency libraries baked in, we can look at migrating to .net 4 later down the track once it's released. For the moment the server has .Net 2, I've submitted for review an upgrade to .net 3.5 SP1 which my dev box has.

Thanks

Update

I've just tried the Parallel.For implementation but it comes up with some weird results. Single threaded:

IRandomGenerator rnd = new MersenneTwister();
IDistribution dist = new DiscreteNormalDistribution(discreteNormalDistributionSize);
List<double> results = new List<double>();

for (int i = 0; i < CHECKPOINTS; i++)
{
 results.AddRange(Oblist.Simulate(rnd, dist, n));
}

To:

Parallel.For(0, CHECKPOINTS, i =>
        {
           results.AddRange(Oblist.Simulate(rnd, dist, n));
        });

Inside simulate there are many calls to rnd.nextUniform(), I think I am getting many values that are the same, is this likely to happen because this is now parallel?

Also maybe issues with the List AddRange call not being thread safe? I see this

System.Threading.Collections.BlockingCollection might be worth using, but it only has an Add method no AddRange so I'd have to look over there results and add in a thread safe manner. Any insight from someone who has used Parallel.For much appreciated. I switched to the System.Random for my calls temporarily as I was getting an exception when calling nextUniform with my Mersenne Twister implementation, perhaps it wasn't thread safe a certain array was getting an index out of bounds....

+6  A: 

First you need to understand why you think that using multiple threads is an optimization - when it is, in fact, not. Using multiple threads will make your workload complete faster only if you have multiple processors, and then at most as many times faster as you have CPUs available (this is called the speed-up). The work is not "optimized" in the traditional sense of the word (i.e. the amount of work isn't reduced - in fact, with multithreading, the total amount of work typically grows because of the threading overhead).

So in designing your application, you have to find pieces of work that can be done in a parallel or overlapping fashion. It may be possible to generate random numbers in parallel (by having multiple RNGs run on different CPUs), but that would also change the results, as you get different random numbers. Another option is have generation of the random numbers on one CPU, and everything else on different CPUs. This can give you a maximum speedup of 3, as the RNG will still run sequentially, and still take 30% of the load.

So if you go for this parallelization, you end up with 3 threads: thread 1 runs the RNG, thread 2 produces normal distribution, and thread 3 does the rest of the simulation.

For this architecture, a producer-consumer architecture is most appropriate. Each thread will read its input from a queue, and produce its output into another queue. Each queue should be blocking, so if the RNG thread falls behind, the normalization thread will automatically block until new random numbers are available. For efficiency, I would pass the random numbers in array of, say, 100 (or larger) across threads, to avoid synchronizations on every random number.

For this approach, you don't need any advanced threading. Just use regular thread class, no pool, no library. The only thing that you need that is (unfortunately) not in the standard library is a blocking Queue class (the Queue class in System.Collections is no good). Codeproject provides a reasonably-looking implementation of one; there are probably others.

Martin v. Löwis
The other issue to consider is context switching. If you didn't chose the architecture above (probably a mistake imo from what you have said) then you would be trying to run a lot of calculations in parallel which would far exceed your number of processors. This would be disastrous as a lot of processor time that was calculating answers previously is now devoted to switching between threads. If you had some file io after each calculation then perhaps that could be done async (but then you would use a queue and pass items to store into a dedicated component).
John Nicholas
The monte carlo calculation is entirely CPU bound, so you are saying I should always map 1 thread to 1 cpu on the box there is never an advantage to going > 1 threads per CPU? unless one thread is waiting on something else it would allow effiencies with the context switches, but in my case I think there is no advantage in fact it would be worse performance.
m3ntat
Correct. If there is really no IO in these threads, then using multiple threads per CPU will slow it down, not speed it up.
Martin v. Löwis
Ok thanks. What is the difference between core and CPU? also does hyper threading make any difference?I've been developing and profiling this on my own machine an: Intel core 2 duo E6550 @ 2.33 GHz (in device manager it is showing as 2 Processors). The server is an: AMD Opteron 275 (in device manager it is showing as 4 Processors). In c# if I do an Environment.ProcessorCount and start that number of threads is that the way to go? Also if I was to propose new hardware for this mission critical app at work what things should I be considering. Thanks
m3ntat
Also my PC is Windows XP, the Server is Windows Server 2003 32 bit. I have just read about the os Windows HPC and am wondering if it's worth it for this migration. Also if this app is entirely CPU bound not memory is there likely to be any speed up going to x64, will functions like Math.Sqrt, matrix multiplications, division be faster on x64 ?
m3ntat
You should consider CPU, core, and processor as synonyms, and only count the total of cores (i.e. two CPU chips with each two cores are the same as one CPU with four cores, roughly). Windows displays each core (appropriately) as a CPU. With HyperThreading, each core may show up as two CPUs; disable HT if that happens. Environment.ProcessorCount should give you the same number as the task manager. Ignore Windows HPC: it is useful only for clusters of several server computers. x64: this really depends on the application. For floating-point operations, there should be no difference.
Martin v. Löwis
Another remark: don't try to overengineer this application. It really sounds like a standard multi-threading use case with a few separable jobs, so the really old, established approaches will work very well. Get that to work, and a speedup of three (on four CPUs) would be excellent.
Martin v. Löwis
A: 

Threading is going to be complicated. You will have to break your program into logical units that can each be run on their own threads, and you will have to deal with any concurrency issues that emerge.

The Parallel Extension Library should allow you to parallelize your program by changing some of your for loops to Parallel.For loops. If you want to see how this works, Anders Hejlsberg and Joe Duffy provide a good introduction in their 30 minute video here:

http://channel9.msdn.com/shows/Going+Deep/Programming-in-the-Age-of-Concurrency-Anders-Hejlsberg-and-Joe-Duffy-Concurrent-Programming-with/

Threading vs. ThreadPool

The ThreadPool, as its name implies, is a pool of threads. Using the ThreadPool to obtain your threads has some advantages. Thread pooling enables you to use threads more efficiently by providing your application with a pool of worker threads that are managed by the system.

Robert Harvey
Hmm, I don't think using the ThreadPool will be more complicated than manual threading--which I think is something you meant to say but left out? In comparing the ThreadPool and manually handling Threads the ThreadPool is more efficient (as it recycles completed threads, thread creation is expensive) and is bit easier to work with--especially if leveraging Delegates. That said I can't speak to compare it to the Parallel libraries--just didn't want the ThreadPool to get a bad name :-)
STW
+1  A: 

List<double> is definitely not thread-safe. See the section "thread safety" in the System.Collections.Generic.List documentation. The reason is performance: adding thread safety is not free.

Your random number implementation also isn't thread-safe; getting the same numbers multiple times is exactly what you'd expect in this case. Let's use the following simplified model of rnd.NextUniform() to understand what is happening:

  1. calculate pseudo-random number from the current state of the object
  2. update state of the object so the next call yields a different number
  3. return the pseudo-random number

Now, if two threads execute this method in parallel, something like this may happen:

  • Thread A calculates a random number as in step 1.
  • Thread B calculates a random number as in step 1. Thread A has not yet updated the state of the object, so the result is the same.
  • Thread A updates the state of the object as in step 2.
  • Thread B updates the state of the object as in step 2, trampling over A's state changes or maybe giving the same result.

As you can see, any reasoning you can do to prove that rnd.NextUniform() works is no longer valid because two threads are interfering with each other. Worse, bugs like this depend on timing and may appear only rarely as "glitches" under certain workloads or on certain systems. Debugging nightmare!

One possible solution is to eliminate the state sharing: give each task its own random number generator initialized with another seed (assuming that instances are not sharing state through static fields in some way).

Another (inferior) solution is to create a field holding a lock object in your MersenneTwister class like this:

private object lockObject = new object();

Then use this lock in your MersenneTwister.NextUniform() implementation:

public double NextUniform()
{
   lock(lockObject)
   {
      // original code here
   }
}

This will prevent two threads from executing the NextUniform() method in parallel. The problem with the list in your Parallel.For can be addressed in a similar manner: separate the Simulate call and the AddRange call, and then add locking around the AddRange call.

My recommendation: avoid sharing any mutable state (like the RNG state) between parallel tasks if at all possible. If no mutable state is shared, no threading issues occur. This also avoids locking bottlenecks: you don't want your "parallel" tasks to wait on a single random number generator that doesn't work in parallel at all. Especially if 30% of the time is spend acquiring random numbers.

Limit state sharing and locking to places where you can't avoid it, like when aggregating the results of parallel execution (as in your AddRange calls).

Wim Coenen
fantastic response thanks! That makes perfect sense. Now the question is should I use add range or find a threadsafe collection that allows me to accumulate the list of random numbers (doubles), add order is unimportant but I do need to periodically sort the results and grab a result at a certain percentile and check for convergence criteria to test for early termination of the simulation, I need to do this for each Parallel.For path running and then cancel all Parallel executions immediately if met as no further processing is required, any idea how to do that?
m3ntat
I don't immediately have an answer for that. Periodic status checks and canceling of pending/running parallel tasks is a big topic on its own; I recommend you post a new question.
Wim Coenen
Although, check out http://blogs.msdn.com/pfxteam/archive/2009/05/22/9635790.aspx
Wim Coenen