How to distribute a program on an unreliable cluster?

views:

196

answers:

+2 Q:

How to distribute a program on an unreliable cluster?

What I'm looking for is any/all of the following:

automatic discovery of worker failure (computer off for instance)
detection of all running (linux) PCs on a given IP address range (computer on)
... and auto worker spawning (ping+ssh?)
load balancing so that workers do not slow down other processes (nice?)
some form of message passing

... and don't want to reinvent the wheel.

C++ library, bash scripts, stand alone program ... all are welcome.

If you give an example of software then please tell us what of above functions does it have.

+3 A:

Check out the Spread Toolkit, a C/C++ group communication system. It will allow you detect node/process failure and recovery/startup, in a manner that allows you to rebalance a distributed workload.

Nick Gunn 2009-05-30 09:26:29

nice. but last release was in 2006

Łukasz Lew 2009-05-30 09:54:10

It is still actively maintained. It's just pretty stable.

Nick Gunn 2009-05-30 09:58:34

Oops, forgot to add, you can get patches via the dev branch and I hear a 4.0.1 version might not be very far away.

Nick Gunn 2009-05-30 09:59:45

That's good news :)

Łukasz Lew 2009-05-30 09:59:59

+1 A:

Depending on your application requirements, I would check out the BOINC infrastructure. They're implementing a form of client/server communication in their latest releases, and it's not clear what form of communication you need. Their API is in C, and we've written wrappers for it in C++ very easily.

The other advantage of BOINC is that it was designed to scale for large distributed computing projects like SETI or Rosetta@Home, so it supports things like validation, job distribution, and management of different application versions for different platforms.

Here's the link:

BOINC website

James Thompson 2009-05-30 09:43:53

Is it efficient/easy to deploy on Local Area Network?

Łukasz Lew 2009-05-30 10:16:23

+1 A:

There is Hadoop. It have Map Reduce, but I'm not sure whether it has any other features I need. Anybody know?

Łukasz Lew 2009-05-30 10:15:10

Hadoop does do this.

monksy 2010-01-21 21:03:00

+1 A:

What you are looking for is called a "job scheduler". There are many job schedulers on the market, these are the ones I'm familiar with:

SGE handles any and all issues related to job scheduling on multiple machines (recovery, monitoring, priority, queuing). Your software does not have to be SGE-aware, since SGE simply provides an environment in which you submit batch jobs.
LSF is a better alternative, but not free.

To support message passing, see the MPI specification. SGE fully supports MPI-based distribution.

ASk 2009-05-30 14:32:39

You are indeed looking for a "job scheduler." Nodes are "statically" registered with a job scheduler. This allows the jobs scheduler to inspect the nodes and determine the core count, RAM, available scratch disc space, OS, and much more. All of that information can be used to select the required resources for a job.

Job schedulers also provide basic health monitoring of the cluster. Nodes that are down are automatically removed from the list of available nodes. Nodes which are running jobs (through the scheduler) are also removed from the list of available nodes.

SLURM is a resource manager & job scheduler that you might consider. SLURM has integration hooks for LSF and PBSPro. Several MPI implementations are "SLURM aware" and can use/set environment variables that will allow an MPI job to run on the nodes allocated to it by SLURM.

semiuseless 2009-06-23 19:36:21

ansaurus

tags:

views:

answers:

How to distribute a program on an unreliable cluster?

related questions