views:

48

answers:

1

I have a master/worker model implemented with separate python processes. The master process holds job/result lists which are protected by mutexes. Many workers run on many machines (about 200 worker processes).

I have noticed that on each machine the workers tend to do 0-20% more or less work than other worker processes and that the machines do 0-20% more or less work than other. The fastest/slowest workers and machines are different every day.

Is this a conceptual problem of the master/worker model, does it hint to a problematic implementation or is everything fine?

+2  A: 

The simplest explanation for the +/- 20% thing is that you're seeing a load balancing problem; some of the workers are just getting 20% more work than some of their peers. This could represent an implementation problem, or it could just be discreteness; if you have 200 worker processes but 1040 roughly-equal jobs to do, then 1/5 of the worker processes are going to have an extra 20% of work to do, and there's nothing to be done about that unless you can subdivide the work more finely.

Master/worker scales (and handles these load balancing issues about as well and easily as anything else) up to the point where contention for the shared resources in the master process starts to become non-trivial. You can push scaling forward a little bit by reducing the critical sections (those protected by mutexes) to an absolute minimum; by aggregating work units so that there are fewer requests (but notice that this works in the opposite direction of improving load balancing); or by having multiple masters (potentially a hierarchy of masters). If that doesn't work, you have to start considering more peer-to-peer work scheduling algorithms, where there is no longer a single bottleneck. A peer-to-peer analogue of master/worker is called work stealing, which is one of those things that (IMHO) doesn't seem like it should work until someone shows you that does; it has been recently popularized by Cilk. The idea is that everyone gets a list of tasks, and if the peers need more work they steal it from each other randomly and continue chugging away until they're done. It's more complicated to implement than master/worker, but avoids the single-master bottleneck.

Jonathan Dursi