views:

72

answers:

2

What's a good method for assigning work to a set of remote machines? Consider an example where the task is very CPU and RAM intensive, but doesn't actually process a large dataset. The language of choice would be Java. I was thinking Hadoop would be a good option, but the dataset passed between remote machines is fairly small, and Hadoop seems to focus mainly on the distribution of data rather than distribution of work.

What are some good technologies that can help?

EDIT: I'm mainly interested in load balancing. There will be a series of jobs with a small (< 3MB) dataset, but significant processing and memory needs.

+3  A: 

MPI would probably be a good choice, there's even a JAVA implementation.

tvanfosson
Voted up for teaching me something :)
karim79
Does it do any sort of load balancing?
User1
MPI is an API that allows you to design and implement parallel algorithms. It's not a magic bullet that you hand your problem to and it automatically runs it in parallel. You have decide how the work is divided up. It just gives you the ability to operate in parallel and communicate information among the cooperating processors/computers.
tvanfosson
+1  A: 

MPI may be part of your answer, but looking at the question, I'm not sure if it addresses the portion of the problem you care about.

MPI provides a communication layer between processing components. It is low level requiring you to do a fair amount of work, but from what I saw in an introduction presentation, it also comes with some common matrix data manipulation functions.

In your question, you seem to be more interested in the load balancing/job processing aspects of the problem. If that really is your focus, maybe a small program hosted in a Servlet or an RMI server might be sufficient. Let each program go to the server for their next unit of work and then submit the results back (you might even be able to use a database/file share, but pay attention to locking issues). In other words, a pull mechanism versus a push mechanism.

This approach is fairly simple to implement and gives you the advantage of scaling up by just running more distributed clients. Load balancing isn't too important if you intend to allow your process to take full control of the machine. You can experiment with running multiple clients on a machine that has multiple cores to see if you can improve overall through-put for the node. A multi-threaded client would be more efficient, but can increase complexity depending on the structure of the code you are using to solve the problem.

Jim Rush
Someone else on my team mentioned RMI as well. I like the "pull" idea! I think I'll go that route. Thanks!
User1