views:

90

answers:

3

I'm performing an operation, lets call it CalculateSomeData. CalculateSomeData operates in successive "generations", numbered 1..x. The number of generations in the entire run is fixed by the input parameters to CalculateSomeData and is known a priori. A single generation takes anywhere from 30 minutes to 2 hours to complete. Some of that variability is due to the input parameters and that cannot be controlled. However, a portion of that variability is due to things like hardware capacities, CPU load from other processes, network bandwidth load, etc. One parameter that can be controlled per-generation is the number of threads that CalculateSomeData uses. Right now that's fixed and likely non-optimal. I'd like to track the time each generation takes and then have some algorithm by which I tweak the number of threads so that each successive generation improves upon the prior generation's calculation time (minimizing time). What approach should I use? How applicable are genetic algorithms? Intuition tells me that the range is going to be fairly tight - maybe 1 to 16 threads on a dual quad-core processor machine.

any pointers, pseudocode, etc. are much appreciated.

+2  A: 

If the calculations are completely CPU bound the number of threads should be equal to the number of cores on the machine. That way you minimize the number of context switches.

If your calculations involve I/O, network, synchronization or something else that blocks execution you must find the limiting resource and measure the utilization. You need to monitor the utilization and slowly add more threads until the utilization gets close to 100%. You should have as few threads as possible to saturate your limiting resource.

Albin Sunnanbo
Hi Albin - This seems quite complicated compared to just measuring the time it takes for a generation and minimizing that through an optimization algorithm. The calculations are bound to CPU, disk, network and its difficult for me to say what the proportion is and how it changes over time. But I know one thing for sure - the faster a generation runs the better...so shouldn't the solution simply be tied to generation time?
SFun28
@SFun28 - What happens if you run 2 threads for each core, do you get ~100% CPU then?
Albin Sunnanbo
Another thing to remember is that you might not be the only user of the resource. In fact, if you're not doing HPC, this is almost certainly the case. Life's simpler if you make the number of threads to use into something that can be manually tuned (though the likes of Apple's Grand Central stuff holds out the prospect of being able to do better at auto-tuning on a shared system than was possible before…)
Donal Fellows
@Donal Fellows - I'd prefer to avoid manual tunning. I could be sitting there for hours or days tweaking it. Surely a simply algo would do a more efficient job than I would at picking the right # threads.
SFun28
@SFun28: Autotuning is harder than it looks, and you can usually take a good first hack at tuning by hand using back-of-envelope calculations. Then you measure, see whether it is Good Enough, and if it is, **stop fiddling with it**. Optimality is hard, but adequate is usually reachable in one or two tries (especially with a bit of experience).
Donal Fellows
+1  A: 

You should divide up your generations into lots of small tasks and put them in a queue. Spawn one thread per core and have each thread grab a task to do, run it to completion, and repeat.

You want lots more tasks than cores to make sure that you don't end up with just one task running at the end of the generation and all other threads idle. This is what is likely to happen if you set #tasks = #threads = #cores as Albin suggests (unless you can ensure that all tasks take precisely the same amount of time).

You also probably don't want more threads than cores. Context switching isn't terribly expensive, but the larger cache footprint that comes with having more than #cores tasks simultaneously active could hurt you (unless your tasks use very little memory).

Keith Randall
Hi Keith - see comment on my question and in reply to Albin. Spawning one thread per core is simple (I'm actually doing that right now) but I think its non-optimal.
SFun28
It is possible that you could use some more threads if portions of your computation are IO bound. But you wouldn't need a lot more. Something like #cores + #disk heads would be good.
Keith Randall
+1  A: 

How about an evolutionary algorithm.

Start with a guess. 1 thread per CPU core seems good, but depends on the task at hand.

Measure the average time for each task in the generation. Compare it to the time taken by the previous generation. (Assume effectively infinite time and 0 threads for generation 0).

If the most recent generation tasks averaged a better time than the one before, continue to change the number of threads in the same direction as you did last step (so if the last generation had more threads than the previous thread, then add a thread for the new generation, but if it had fewer, then use one fewer (obviously with a lower limit of 1 thread).

If the most recent generation tasks took longer, on average, than the previous generation, then change the number of threads in the opposite direction (so if increasing the number of threads resulted in worse time, use one fewer thread next time).

As long as the optimal number of threads isn't too close to 1, then you'll probably end up oscillating between 3 values that are all reasonably close to optimal. You may want to explicitly detect this case and lock yourself into the central value, if you have a large number of generations to deal with.

Bill Michell
@Bill Mitchell - that's what I was looking for! Since we're minimizing time, I think we assume infinite time for gen 0. 0 threads for gen 0 isn't strictly necessary because gen 1 will always be better than gen 0 ( < infinity). Of course, since you initially set # threads = # cores, the system has to randomly decide to whether to increase or decrease by 1 for gen 2 (since there's no notion of "same direction as you did last step" until you have 2 generations complete). Does that sound about right?
SFun28
Yes, you're right. Infinite time for G0 works better. The way I phrased it, the intention was that we should always increase the threads by 1 for G2
Bill Michell
the algorithm you describe sounds more like greedy hill climbing than evolutionary. In this case is probably a good solution but I don't see how this approach is evolutionary.
JohnIdol
@JohnIdol I guess there is only one trial per generation, so maybe evolutionary isn't a good term.
Bill Michell