views:

206

answers:

4

I'm getting into parallel programming and I'm studying mapreduce and other distributed algorithms. Is it best just to learn mapreduce or is there a more general algorithm that will serve me better?

+4  A: 

It depends what you intend to use the algorithm(s) for.

MapReduce is a generalised and very useful programming model. (Google bases many of it's internal indexing processes on it). Learning it certainly won't do you any harm.

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

The most important parallel processing concept to learn is quite simple: synchronisation is what you need to minimise if you want to attain effective speedup.

Strive for:

  • Large granularity of work chunks
  • Keep size work chunks similiar in size
  • Minimise the number of synchronisation steps
Mitch Wheat
I want to use the algorithm for many things. Its just that I'm not a developer so it takes me quite a long time to learn mareduce, and I want to avoid the situation where I learn map reduce well, and then I find out that there is some more general concept I should have learnt. I guess I need baby steps :)
Zubair
+2  A: 

If you want to learn something about parallel processing, I do not believe that picking a single algorithm will provide you with significant insights.

Mapreduce is a composition of a map and a reduce operation. These are typical higher-order functions that functional languages provide.

I would recommend first to learn a functional language, for example Scheme or Clojure. For Scheme, "Structure and Interpretation of Computer Programs" seems to be all the rage.

Svante
@svante. Thanks for the advice. I guess this parallel programming is alot more complicated than I first thought. I'll check out the Scheme you mentioned
Zubair
The power of mapreduce lies actually in what is *between* the map and reduce phase, namely, grouping results of `map` by their key before feeding each resulting group to a reducer.
jkff
Now, that makes sense. The grouping of results by their key is done on each local node I presume?
Zubair
No, it is done with a distributed sorting algorithm. It is the performance of this algorithm that largely determines performance of the mapreduce implementation.
jkff
+2  A: 

For many "regular" serial algorithms, there are parallel versions, some of which can be modeled with MapReduce. Certainly learn MapReduce, as it is new and exciting, but it's just another tool in your toolbox, and you certainly can learn more, as there are limitations to MapReduce (and you'll learn about them).

Larry
+1  A: 

To really get a good appreciation of parallel programming, you should study several models of parallel programming and not just one parallel programming framework. You should study both shared memory (e.g. pthreads) and message passing (e.g. MPI and MapReduce) approaches to parallel programming.

MPI is a very general purpose tool for creating message-passing applications. If you use MPI extensively, you will find that some elements of MPI programs recur over and over again, such as setting up a "master" process that partitions work to "worker" processes, and aggregates the results. MapReduce is a particular implementation of a message-passing framework and provides a simpler programming model than MPI. It takes care of code that occurs quite frequently in parallel applications and, more importantly, takes care of such issues as failure recovery and data locality. The opensource Hadoop attempts to mimic MapReduce.

I think you will be better able to appreciate what MapReduce does and how it might be implemented by writing several MPI programs of your own. It can't hurt to learn Hadoop, but when it comes to general knowledge of parallel programming, it is good to be familiar with the basics like pthreads, OpenMP, and MPI.

Michael Aaron Safyan