views:

881

answers:

3

As far as I understand, MPI gives me much more control over how exactly different nodes in the cluster will communicate.

In MapReduce/Hadoop, each node does some computation, exchanges data with other nodes, and then collates its partition of results. Seems simple, but since you can iterate the process, even algorithms like K-means or PageRank fit the model quite well. On a distributed file system with locality of scheduling, the performance is apparently good. In comparison, MPI gives me explicit control over how nodes send messages to each other.

Can anyone describe a cluster programming scenario where the more general MPI model is an obvious advantage over the simpler MapReduce model?

+2  A: 

Hi

Almost any scientific code -- finite differences, finite elements, etc. Which kind of leads to the circular answer, that any distributed program which doesn't easily map to MapReduce would be better implemented with a more general MPI model. Not sure that's much help to you, I'll downvote this answer right after I post it.

Mark

High Performance Mark
Oh, I can't downvote my own answers -- could someone else to it for me.
High Performance Mark
Thanks, Mark (no need to downvote). Do you mean that iterative algorithms are more efficient in MPI, since in MapReduce they have to be implemented with a sequence of jobs? Apparently, MapReduce has acceptable performance at least for some iterative algorithms.
Igor ostrovsky
Not really. I was thinking of computations such as finite difference solvers, in which individual processes (on individual processors) computer over part of the total domain, then exchange halo information, then carry on computing. I find it difficult to see how this would map to MapReduce.
High Performance Mark
In MapReduce, it is implemented by multiple jobs. Each MapReduce job is of the form: compute results, then exchange them. Multiple jobs can implement multiple "exchanges". With locality of scheduling, the next iteration of jobs is scheduled so that each task reads the data that was written to the local node by a task in the previous job, so the cost of multiple rounds of jobs is reduced.
Igor ostrovsky
Hmmm, I'll have to look a bit more closely at MapReduce. However, one source of performance reduction with MapReduce may be the strict sequencing of computation with communication; with MPI we try very hard (usually without much success) to overlap them.
High Performance Mark
Iterative algorithms are fine with a MapReduce framework, for example "run this job on the previous job results until a condition is met or we decide to give up". There are job control schemes for Hadoop which abstract this away behind a query language. What the map-reduce paradigm doesn't do is communicate between nodes - no "start reducing when enough results are found among all mappers". So yes, no overlapping or skipping one un-needed mapper because another found what was needed.
Karl Anderson
Overlapping communication with computation is mostly a myth. Expensive networks can do it (they use DMA), but normally the CPU is involved with packing buffers. We don't yet have nonblocking collectives (though this might go into MPI-3) which is the use case where a lot of computation could be meaningfully performed. MPI is a much more general and higher performance model, MapReduce offers a convenient abstraction with better fault tolerance for use cases where the "parallel" part of the algorithm is almost trivial.
Jed
+1  A: 

The best answer that I could come up with is that MPI is better than MapReduce in two cases:

  1. For short tasks rather than batch processing. For example, MapReduce cannot be used to respond to individual queries - each job is expected to take minutes. I think that in MPI, you can build a query response system where machines send messages to each other to route the query and generate the answer.

  2. For jobs nodes need to communicate more than what iterated MapReduce jobs support, but not too much so that the communication overheads make the computation impractical. I am not sure how often such cases occur in practice, though.

Igor ostrovsky
map reduce tasks can take milliseconds too, there is no requirement to say that they must take minutes
Zubair
+2  A: 

Athough, this question has been answered, I would like to add/reiterate one very important point.

MPI is best suited for problems that require a lot of interprocess communication.

When Data becomes large (petabytes, anyone?), and there is little interprocess communication, MPI becomes a pain. This is so because the processes will spend all the time sending data to each other (bandwidth becomes a limiting factor) and your CPUs will remain idle. Perhaps an even bigger problem is reading all that data.

This is the fundamental reason behind having something like Hadoop. The Data also has to be distributed - Hadoop Distributed File System!

To say all this in short, MPI is good for task parallelism and Hadoop is good for Data Parallelism.

Gitmo

related questions