views:

319

answers:

1

Is it correct to say that the parallel computation with iterative MapReduce can be justified mainly when the training data size is too large for the non-parallel computation for the same logic?

I am aware that the there is overhead for starting MapReduce jobs. This can be critical for overall execution time when a large number of iterations is required.

I can imagine that the sequential computation is faster than the parallel computation with iterative MapReduce as long as the memory allows to hold a data set in many cases.

+1  A: 

No parallel processing system makes much sense if a single machine does the job, most of the time. The complexity associated with most parallelization tasks is significant and requires a good reason to make use of it.

Even when it's obvious that a task can't be resolved without parallel processing in acceptable time, parallel execution frameworks come in different flavours: from the more low-level, science-oriented tools like PVM or MPI to high-level, specialized (e.g. map/reduce) frameworks like Hadoop.

Among the parameters you should consider are start time and scalability (how close to linear does the system scale). Hadoop will not be a good choice if you need answers quickly, but might be a good choice if you can fit your process into a map-reduce frame.

Tomislav Nakic-Alfirevic