views:

207

answers:

2

This question does not have a single "right" answer.

I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data.

I want to learn more about the running time of said algorithms.

What books should I read?

I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theoretical treatments or running time.

EDIT: The issue is not that map reduce changes running time. The issue is -- most algorithms do not distribute well to map reduce frameworks. I'm interested in algorithms that run on the map reduce framework.

+3  A: 

Technically, there's no real different in the runtime analysis of MapReduce in comparison to "standard" algorithms - MapReduce is still an algorithm just like any other (or specifically, a class of algorithms that occur in multiple steps, with a certain interaction between those steps).

The runtime of a MapReduce job is still going to scale how normal algorithmic analysis would predict, when you factor in division of tasks across multiple machines and then find the maximum individual machine time required for each step.

That is, if you have a task which requires M map operations, and R reduce operations, running on N machines, and you expect that the average map operation will take m time and the average reduce operation r time, then you'll have an expected runtime of ceil(M/N)*m + ceil(R/N)*r time to complete all of the tasks in question.

Prediction of the values for M,R,m, and r are all something that can be accomplished with normal analysis of whatever algorithm you're plugging into MapReduce.

Amber
A: 

There are only two books that i know of that are published, but there are more in the works:

Pro hadoop and Hadoop: The Definitive Guide

Of these, Pro Hadoop is more of a beginners book, whilst The Definitive Guide is for those that know what Hadoop actually is.

I own The Definitive Guide and think its an excellent book. It provides good technical details on how the HDFS works, as well as covering a range of related topics such as MapReduce, Pig, Hive, HBase etc. It should also be noted that this book was written by Tom White who has been involved with the development of Hadoop for a good while, and now works at cloudera.

As far as the analysis of algorithms goes on Hadoop you could take a look at the TeraByte sort benchmarks. Yahoo have done a write up of how Hadoop performs for this particular benchmark: TeraByte Sort on Apache Hadoop. This paper was written in 2008.

More details about the 2009 results can be found here.

Binary Nerd