views:

129

answers:

3

Hello there, so here is my problem:

I have several different configuarion servers. I have different calculations (jobs); I can predict how long approximately each job will take to be caclulated. Also, I have priorities. My question is how to keep all machines loaded 99-100% and schedule the jobs in the best way.

Each machine can do several calculations at a time. Jobs are pushed to the machine. The central machine knows the current load of each machine. Also, I would like to to assign some kind of machine learning here, because I will know statistics of each job (started, finished, cpu load etc.).

How can I distribute jobs (calculations) in the best possible way, keeping in mind the priorities?

Any suggestions, ideas, or algorithms?

FYI: My platform .NET.

+1  A: 

Looks like this has very little to do with .NET.

But think of your machines as 'worker threads', make a 'pool' of available machines ordered on available CPU (or other important resource), then use your knowledge of each task to push each job to the best fitted machine.

If you know all the jobs upfront, you could probably use a 'best fit' algorithm to schedule them in the correct order on the correct machines. You could also look at 'cutting stock' algorithms; http://en.wikipedia.org/wiki/Cutting_stock_problem ...

Jørn Jensen
+3  A: 
  1. Look at Dryad linq. It already in academic release and may be useful.
  2. Win HPC server - enterprise solution for distributed computing from Microsoft.
  3. Some code samples which can help to build load balancing by analyzing performance counters.
  4. Microsoft has StockTrader sample application (with sources), which is example of distributable SOA with hand-written RoundRobin load balancing.
Yauheni Sivukha
+1  A: 

Microsoft recently published a paper on their quincy scheduler. If you are simply optimizing for CPU utilization then a very simple solver can find the global optimum. If you need optimization across more axes then obviously the problem space will be more complicated.

How big is your cluster? How do you deal with optimizing around failure cases? Do they matter? Is there IO? Does data have disk affinity? Is there more than one place to run a piece of a job? All things to consider.

Steve