views:

1907

answers:

8

Related to my CouchDB question....

Can anyone explain MapReduce in terms a numbnuts could understand?

+1  A: 

What's wrong with the Wikipedia page?

OysterD
The wikipedia page seems to indicate that google invented mapreduce, which is wrong.
Zubair
+11  A: 
  1. Take a bunch of data
  2. Perform some kind of transformation that converts every datum to another kind of datum
  3. Combine those new data into yet simpler data

Step 2 is Map. Step 3 is Reduce.

For example,

  1. Get time between two impulses on a pair of pressure meters on the road
  2. Map those times into speeds based upon the distance of the meters
  3. Reduce those speeds to an average speed

The reason MapReduce is split between Map and Reduce is because different parts can easily be done in parallel. (Especially if Reduce has certain mathematical properties.)

For a complex but good description of MapReduce, see: Google's MapReduce Programming Model -- Revisited (PDF).

Frank Krueger
I would say for step 3, "combine" instead of "transform"
TraumaPony
+1 for a nice simple example.
Frank Shearar
+4  A: 

Link to the paper where it is described: Map-Reduce paper

Iker Jimenez
+4  A: 

Let's take the example from the Google paper. The goal of MapReduce is to be able to use efficiently a load of processing units working in parallels for some kind of algorithms. The exemple is the following: you want to extract all the words and their count in a set of documents.

Typical implementation:

for each document
    for each word in the document
  get the counter associated to the word for the document
     increment that counter 
    end for
end for

MapReduce implementation:

Map phase (input: document key, document)
for each word in the document
    emit an event with the word as the key and the value "1"
end for

Reduce phase (input: key (a word), an iterator going through the emitted values)
for each value in the iterator
    sum up the value in a counter
end for

Around that, you'll have a master program which will partition the set of documents in "splits" which will be handled in parallel for the Map phase. The emitted values are written by the worker in a buffer specific to the worker. The master program then delegates other workers to perform the Reduce phase as soon as it is notified that the buffer is ready to be handled.

Every worker output (being a Map or a Reduce worker) is in fact a file stored on the distributed file system (GFS for Google) or in the distributed database for CouchDB.

Damien B
+6  A: 

MapReduce is a method to process vast sums of data in parallel without requiring the developer to write any other code other than the mapper and reduce functions.

The map function takes data in and churns out a result. which is held in a barrier. This function can run in parallel with a large number of the same Map task. The dataset can then be reduced to a scalar value.

So if you think of it like a SQL statement

SELECT SUM(salary)
FROM employees
WHERE salary > 1000
GROUP by deptname

We can use map to get our subset of employees with salary > 1000 which map emits to the barrier into group size buckets.

Reduce will sum each of those groups. Giving you your result set.

just plucked this from my university study notes of the google paper

John Nolan
+21  A: 

Going all the way down to the basics for Map and Reduce.


Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.

suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say

A = [1, 2, 3]
foreach (item in A) A[item] = A[item] * 2

and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write

A = [1, 2, 3].Map(x => x * 2)

the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.

1 : 1 => 1 * 2 : 2  
2 : 2 => 2 * 2 : 4  
3 : 3 => 3 * 2 : 6

so after executing the map function with (x => x * 2) you'd have [2, 4, 6].


Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.

Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this

A = [7, 8, 9]
sum = 0
foreach (item in A) sum = sum + A[item]

But, if you have access to a reduce function, you could write it like this

A = [7, 8, 9]
sum = A.reduce( 0, (x, y) => x + y )

Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.

the execution would follows:

result = 0
7 : result = result + 7 = 0 + 7 = 7
8 : result = result + 8 = 7 + 8 = 15
9 : result = result + 9 = 15 + 9 = 24

But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first result = line.

say you want to sum 2 lists, it might look like this:

A = [7, 8, 9]
B = [1, 2, 3]
sum = 0
sum = A.reduce( sum, (x, y) => x + y )
sum = B.reduce( sum, (x, y) => x + y )


Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.

You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.

There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.

Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.

chakrit
Ok, i understand the map and reduce taken individually. But what applications could i have of the reduce? In a Google scenario would they use it for example for summing a series of parameters that give them the ranking of a page for a given keyword?
L. De Leo
@lbolognini var total = orderes.Sum(o => o.UnitPrice * o.Quantity)
chakrit
@lbolognini There are many uses when you abstract away the very concept of looping. In Google's scenario they probably have 1000s of machines for calculating pageranks, links and whatnot. What do they do when they need to add a few more servers? Modifying every single looping code is probably not an option. So what they did is that they write their calculation code against a "Reduce" function instead... And when the list of servers changes, only the "Reduce" function needs to be changed. Got it?
chakrit
how would reduce compute the average? from what i see i'm guessing you couldn't? maybe map the numerator and denominator and divide at the end of summing both?
arcticpenguin
@arcticpenguin I'm being a little too generic there. Actually `Average()` is supposedly icing on top of `Sum()`. But I talked about it to illustrate why the function is called "Reduce"... An average function is something that takes a list of numbers and *reduces* it to a single number (which is the average).
chakrit
+15  A: 

Joel Spolsky has a good explanation for beginners - http://www.joelonsoftware.com/items/2006/08/01.html

Naseer
+3  A: 

MAP and REDUCE are old Lisp functions.

Imagine you have a list of cities with informations about the name, number of people living there and the size of the city:

(defparameter *cities*
  '((a :people 100000 :size 200)
    (b :people 200000 :size 300)
    (c :people 150000 :size 210)))

Now you may want to find the city with the highest population density.

First we create a list of city names and population density using MAP:

(map 'list
     (lambda (city)
         (list (first city)
               (/ (getf (rest city) :people)
                  (getf (rest city) :size))))
     *cities*)

=>   ((A 500) (B 2000/3) (C 5000/7))

Using REDUCE we can now find the city with the largest population density.

(reduce (lambda (a b)
          (if (> (second a) (second b))
             a
             b))
        '((A 500) (B 2000/3) (C 5000/7)))

 =>   (C 5000/7)

Combining both parts we get the following code:

(reduce (lambda (a b)
          (if (> (second a) (second b))
             a
             b))
        (map 'list
             (lambda (city)
                (list (first city)
                   (/ (getf (rest city) :people)
                      (getf (rest city) :size))))
             *cities*))

Let's introduce functions:

(defun density (city)
   (list (first city)
         (/ (getf (rest city) :people)
            (getf (rest city) :size))))

(defun max-density (a b)
   (if (> (second a) (second b))
          a
          b))

Then we can write our MAP REDUCE code as:

(reduce 'max-density
        (map 'list 'density *cities*))

 =>   (C 5000/7)
Rainer Joswig