views:

37

answers:

3

Dear Overflowers:

At work we perform demanding numerical computations.

We have a network of several Linux boxes with different processing capabilities. At any given time, there can be anywhere from zero to dozens of people connected to a given box.

I created a script to measure the MFLOPS (Million of Floating Point Operations per Second) using the Linpack Benchmark; it also provides number of cores and memory.

I would like to use this information together with the load average (obtained using the uptime command) to suggest the best computer for performing a demanding computation. In other words, its 3:00pm; I have a meeting in two hours; I need to run a demanding process: what node will get me the answer fastest?

I envision a script which will output a suggestion along the lines of:

SUGGESTED HOSTS (IN ORDER OF PREFERENCE)
HOST1.MYNETWORK
HOST2.MYNETWORK
HOST3.MYNETWORK

Such suggestion should favor fast computers (high MFLOPS) if the load average is low and, as load average increases for a given node, it should favor available nodes instead (i.e., I'd rather run in a slower computer with no users than in an eight-core with forty dudes logged in).

How should I prioritize? What algorithm (rationale) would you use? Again, what I have is:

  1. Load Average (1min, 5min, 15min)
  2. MFLOPS measure
  3. Number of users logged in
  4. RAM (installed and available)
  5. Number of cores (important to normalize the load average)

Any thoughts? Thanks!

A: 

Have you considered a distributed approach to computation? Not all computations can be broken up such that more than one cpu can work on them. But perhaps your problem space can benefit from some parallelization. Have a look at Hadoop.

Asaph
Unfortunately at this point our software is not candidate for distributed computing. However, I have looked into Hadoop, and I am pretty sure that the Map/Reduce will become a hot topic on our area in the short future - I may even use it for some file processing we've been working on. Thanks!
Arrieta
+1  A: 

You don't have enough data to make an well-informed decision. It sounds as though the scheduling is very volatile: "At any given time, there can be anywhere from zero to dozens of people connected to a given box." So the current load does not necessarily reflect the future load of the machines.

To properly asses what hosts someone should use to minimize computation time would require knowing when the current jobs will terminate. If a powerful machine is about to be done doing most of its jobs, it would be a good candidate even though it currently has a high load.

If you want to guess purely on the current situation, you can do a weighed calculation to find out which hosts have the most MFLOPS available.

MFLOPS available = host's MFLOPS + (number of logical processors - load average)

Sort the hosts by MFLOPS available and suggest them in a descending order.

This formula assumes that the MFLOPS of a host is linearly related to its load average. This might not be exactly true, but it's probably fairly close.

I would favor the most recent load average since it's closer to the current/future situation, whereas, jobs from 15 minutes ago might have completed by now.

Ben S
I like your point, and I hope that by starting with this simple script, I will reduce the volatility on the node scheduling - if more people use the script, it will help balancing the load.As for the task duration, you are completely right, and I will find a way to factor that in (we do have an estimate for some tasks).I like your approach on "available MFLOPS". Thanks!
Arrieta
You could calculate the approximate MFLO of a calculation by running it by itself on a host with a known MFLOPS and multiply the number of seconds the calculation ran by the MFLOPS of the host. Once you have that, you can estimate how long a calculation will take, given the available MFLOPS of a host. For bonus points, make a simple web-app to manage the scheduling so that users can schedule calculations of varying priority.
Ben S
A: 

You don't need to know FLOPS. beowulf modules paralell computing center has I go to has the script for sure

PDC operates leading-edge, high-performance computers on a national level. PDC offers easily accessible computational resources that primarily cater to the ...

LarsOn
I'm sorry, I don't understand your answer.
Arrieta
Link is beowulf.org
LarsOn