views:

196

answers:

5

What I'm looking for is any/all of the following:

  • automatic discovery of worker failure (computer off for instance)
  • detection of all running (linux) PCs on a given IP address range (computer on)
  • ... and auto worker spawning (ping+ssh?)
  • load balancing so that workers do not slow down other processes (nice?)
  • some form of message passing

... and don't want to reinvent the wheel.

C++ library, bash scripts, stand alone program ... all are welcome.

If you give an example of software then please tell us what of above functions does it have.

+3  A: 

Check out the Spread Toolkit, a C/C++ group communication system. It will allow you detect node/process failure and recovery/startup, in a manner that allows you to rebalance a distributed workload.

Nick Gunn
nice. but last release was in 2006
Łukasz Lew
It is still actively maintained. It's just pretty stable.
Nick Gunn
Oops, forgot to add, you can get patches via the dev branch and I hear a 4.0.1 version might not be very far away.
Nick Gunn
That's good news :)
Łukasz Lew
+1  A: 

Depending on your application requirements, I would check out the BOINC infrastructure. They're implementing a form of client/server communication in their latest releases, and it's not clear what form of communication you need. Their API is in C, and we've written wrappers for it in C++ very easily.

The other advantage of BOINC is that it was designed to scale for large distributed computing projects like SETI or Rosetta@Home, so it supports things like validation, job distribution, and management of different application versions for different platforms.

Here's the link:

BOINC website

James Thompson
Is it efficient/easy to deploy on Local Area Network?
Łukasz Lew
+1  A: 

There is Hadoop. It have Map Reduce, but I'm not sure whether it has any other features I need. Anybody know?

Łukasz Lew
Hadoop does do this.
monksy
+1  A: 

What you are looking for is called a "job scheduler". There are many job schedulers on the market, these are the ones I'm familiar with:

  • SGE handles any and all issues related to job scheduling on multiple machines (recovery, monitoring, priority, queuing). Your software does not have to be SGE-aware, since SGE simply provides an environment in which you submit batch jobs.
  • LSF is a better alternative, but not free.

To support message passing, see the MPI specification. SGE fully supports MPI-based distribution.

ASk
A: 

You are indeed looking for a "job scheduler." Nodes are "statically" registered with a job scheduler. This allows the jobs scheduler to inspect the nodes and determine the core count, RAM, available scratch disc space, OS, and much more. All of that information can be used to select the required resources for a job.

Job schedulers also provide basic health monitoring of the cluster. Nodes that are down are automatically removed from the list of available nodes. Nodes which are running jobs (through the scheduler) are also removed from the list of available nodes.

SLURM is a resource manager & job scheduler that you might consider. SLURM has integration hooks for LSF and PBSPro. Several MPI implementations are "SLURM aware" and can use/set environment variables that will allow an MPI job to run on the nodes allocated to it by SLURM.

semiuseless