I'm getting into parallel programming and I'm studying mapreduce and other distributed algorithms. Is it best just to learn mapreduce or is there a more general algorithm that will serve me better?
It depends what you intend to use the algorithm(s) for.
MapReduce is a generalised and very useful programming model. (Google bases many of it's internal indexing processes on it). Learning it certainly won't do you any harm.
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
The most important parallel processing concept to learn is quite simple: synchronisation is what you need to minimise if you want to attain effective speedup.
Strive for:
- Large granularity of work chunks
- Keep size work chunks similiar in size
- Minimise the number of synchronisation steps
If you want to learn something about parallel processing, I do not believe that picking a single algorithm will provide you with significant insights.
Mapreduce is a composition of a map
and a reduce
operation. These are typical higher-order functions that functional languages provide.
I would recommend first to learn a functional language, for example Scheme or Clojure. For Scheme, "Structure and Interpretation of Computer Programs" seems to be all the rage.
For many "regular" serial algorithms, there are parallel versions, some of which can be modeled with MapReduce. Certainly learn MapReduce, as it is new and exciting, but it's just another tool in your toolbox, and you certainly can learn more, as there are limitations to MapReduce (and you'll learn about them).
To really get a good appreciation of parallel programming, you should study several models of parallel programming and not just one parallel programming framework. You should study both shared memory (e.g. pthreads) and message passing (e.g. MPI and MapReduce) approaches to parallel programming.
MPI is a very general purpose tool for creating message-passing applications. If you use MPI extensively, you will find that some elements of MPI programs recur over and over again, such as setting up a "master" process that partitions work to "worker" processes, and aggregates the results. MapReduce is a particular implementation of a message-passing framework and provides a simpler programming model than MPI. It takes care of code that occurs quite frequently in parallel applications and, more importantly, takes care of such issues as failure recovery and data locality. The opensource Hadoop attempts to mimic MapReduce.
I think you will be better able to appreciate what MapReduce does and how it might be implemented by writing several MPI programs of your own. It can't hurt to learn Hadoop, but when it comes to general knowledge of parallel programming, it is good to be familiar with the basics like pthreads, OpenMP, and MPI.