views:

395

answers:

6

Dear All, I have a situation here where I need to distribute work over to multiple JAVA processes running in different JVMs, probably different machines.

Lets say I have a table with records 1 to 1000. I am looking for work to be collected and distributed is sets of 10. Lets say records 1-10 to workerOne. Then records 11-20 to workerThree. And so on and so forth. Needless to say workerOne never does the work of workerTwo unless and until workerTwo couldnt do it.

This example was purely based on database but could be extended to any system, I believe be it File processing, email processing and so forth.

I have a small feeling that the immediate response would be to go for a Master/Worker approach. However here we are talking about different JVMs. Even if one JVM were to come down the other JVM should just keep doing its work.

Now the million dollar question would be: Are there any good frameworks(production ready) that would give me facility to do this. Even if there are concrete implementations of specific needs like Database records, File processing, Email processing and their likes.

I have seen the Java Parallel Execution Framework, but am not sure if it can be used for different JVMs and if one were to come down would the other keep going.I believe Workers could be on multiple JVMs, but what about the Master?

More Info 1: Hadoop would be a problem because of the JDK 1.6 requirement. Thats bit too much.

Thanks, Franklin

+2  A: 

Might want to look into MapReduce and Hadoop

Eric Petroelje
+1  A: 

Check out Hadoop

AgileJon
A: 

If you work on records in a single database, consider performing the work within the database itself using stored procedures. The gain for processing the records on different machine might be negated by the cost of retrieving and transmitting the work between the database and the computing nodes.

For file processing it could be a similar case. Working on files in (shared) filesystem might introduce large I/O pressure for OS.

And the cost for maintaining multiple JVM's on multiple machines might be an overkill too.

And for the question: I used the JADE (Java Agent Development Environment) for some distributed simulation once. Its multi-machine suppord and message passing nature might help you.

kd304
+1  A: 

You could also use message queues. Have one process that generates the list of work and packages it in nice little chunks. It then plops those chunks on a queue. Each one of the workers just keeps waiting on the queue for something to show up. When it does, the worker pulls a chunk off the queue and processes it. If one process goes down, some other process will pick up the slack. Simple and people have been doing it that way for a long time so there's a lot information about it on the net.

mamboking
+1 for the JMS solution
kd304
+1  A: 

I believe Terracotta can do this. If you are dealing with web pages, JBoss can be clustered.

If you want to do this yourself you will need a work manager which keeps track of jobs to do, jobs in progress and jobs never done which needs to be rescheduled. The workers then ask for something to do, do it, and send the result back, asking for more.

You may want to elaborate on what kind of work you want to do.

Thorbjørn Ravn Andersen
+1  A: 

The problem you've described is definitely best solved using the master/worker pattern.

You should have a look into JavaSpaces (part of the Jini framework), it's really well suited to this kind of thing. Basically you just want to encapsulate each task to be carried out inside a Command object, subclassing as necesssary. Dump these into the JavaSpace, let your workers grab and process one at a time, then reassemble when done.

Of course your performance gains will totally depend on how long it takes you to process each set of records, but JavaSpaces won't cause any problems if distributed across several machines.

alatkins
Am not really sure but no implementation of JavaSpaces seems to be mature(I mean opensource). Incase you know any please let me know. Thanks a lot. But JavaSpaces seems to rock. I wish they were pretty mature. At least Apache River!!
Franklin
Well there is JavaSpaces itself, which should still be available (as part of Jini - http://www.jini.org/wiki/Category:Getting_Started). GigaSpaces is definitely a mature product, and has a free version available. I've had no experience with it, however. If open source is a requirement then take a look at Blitz (http://www.dancres.org/blitz/).
alatkins
SemiSpace is OSS also http://www.semispace.org/semispace/
Taylor Gautier
Javaspaces have been around since the early days of Java, and there's little I use nowadays that is *more* mature than the Sun outrigger implementation (i.e. the standard Javaspace implementation)
Brian Agnew