Related to my CouchDB question....
Can anyone explain MapReduce in terms a numbnuts could understand?
Related to my CouchDB question....
Can anyone explain MapReduce in terms a numbnuts could understand?
Step 2 is Map. Step 3 is Reduce.
For example,
The reason MapReduce is split between Map and Reduce is because different parts can easily be done in parallel. (Especially if Reduce has certain mathematical properties.)
For a complex but good description of MapReduce, see: Google's MapReduce Programming Model -- Revisited (PDF).
Let's take the example from the Google paper. The goal of MapReduce is to be able to use efficiently a load of processing units working in parallels for some kind of algorithms. The exemple is the following: you want to extract all the words and their count in a set of documents.
Typical implementation:
for each document
for each word in the document
get the counter associated to the word for the document
increment that counter
end for
end for
MapReduce implementation:
Map phase (input: document key, document)
for each word in the document
emit an event with the word as the key and the value "1"
end for
Reduce phase (input: key (a word), an iterator going through the emitted values)
for each value in the iterator
sum up the value in a counter
end for
Around that, you'll have a master program which will partition the set of documents in "splits" which will be handled in parallel for the Map phase. The emitted values are written by the worker in a buffer specific to the worker. The master program then delegates other workers to perform the Reduce phase as soon as it is notified that the buffer is ready to be handled.
Every worker output (being a Map or a Reduce worker) is in fact a file stored on the distributed file system (GFS for Google) or in the distributed database for CouchDB.
MapReduce is a method to process vast sums of data in parallel without requiring the developer to write any other code other than the mapper and reduce functions.
The map function takes data in and churns out a result. which is held in a barrier. This function can run in parallel with a large number of the same Map task. The dataset can then be reduced to a scalar value.
So if you think of it like a SQL statement
SELECT SUM(salary)
FROM employees
WHERE salary > 1000
GROUP by deptname
We can use map to get our subset of employees with salary > 1000 which map emits to the barrier into group size buckets.
Reduce will sum each of those groups. Giving you your result set.
just plucked this from my university study notes of the google paper
Going all the way down to the basics for Map and Reduce.
Map is a function which "transforms" items in some kind of list to another kind of item and put them back in the same kind of list.
suppose I have a list of numbers: [1,2,3] and I want to double every number, in this case, the function to "double every number" is function x = x * 2. And without mappings, I could write a simple loop, say
A = [1, 2, 3]
foreach (item in A) A[item] = A[item] * 2
and I'd have A = [2, 4, 6] but instead of writing loops, if I have a map function I could write
A = [1, 2, 3].Map(x => x * 2)
the x => x * 2 is a function to be executed against the elements in [1,2,3]. What happens is that the program takes each item, execute (x => x * 2) against it by making x equals to each item, and produce a list of the results.
1 : 1 => 1 * 2 : 2
2 : 2 => 2 * 2 : 4
3 : 3 => 3 * 2 : 6
so after executing the map function with (x => x * 2) you'd have [2, 4, 6].
Reduce is a function which "collects" the items in lists and perform some computation on all of them, thus reducing them to a single value.
Finding a sum or finding averages are all instances of a reduce function. Such as if you have a list of numbers, say [7, 8, 9] and you want them summed up, you'd write a loop like this
A = [7, 8, 9]
sum = 0
foreach (item in A) sum = sum + A[item]
But, if you have access to a reduce function, you could write it like this
A = [7, 8, 9]
sum = A.reduce( 0, (x, y) => x + y )
Now it's a little confusing why there are 2 arguments (0 and the function with x and y) passed. For a reduce function to be useful, it must be able to take 2 items, compute something and "reduce" that 2 items to just one single value, thus the program could reduce each pair until we have a single value.
the execution would follows:
result = 0
7 : result = result + 7 = 0 + 7 = 7
8 : result = result + 8 = 7 + 8 = 15
9 : result = result + 9 = 15 + 9 = 24
But you don't want to start with zeroes all the time, so the first argument is there to let you specify a seed value specifically the value in the first result =
line.
say you want to sum 2 lists, it might look like this:
A = [7, 8, 9]
B = [1, 2, 3]
sum = 0
sum = A.reduce( sum, (x, y) => x + y )
sum = B.reduce( sum, (x, y) => x + y )
Its a good thing in a DB software because, with Map\Reduce support you can work with the database without needing to know how the data are stored in a DB to use it, thats what a DB engine is for.
You just need to be able to "tell" the engine what you want by supplying them with either a Map or a Reduce function and then the DB engine could find its way around the data, apply your function, and come up with the results you want all without you knowing how it loops over all the records.
There are indexes and keys and joins and views and a lot of stuffs a single database could hold, so by shielding you against how the data is actually stored, your code are made easier to write and maintain.
Same goes for parallel programming, if you only specify what you want to do with the data instead of actually implementing the looping code, then the underlying infrastructure could "parallelize" and execute your function in a simultaneous parallel loop for you.
Joel Spolsky has a good explanation for beginners - http://www.joelonsoftware.com/items/2006/08/01.html
MAP and REDUCE are old Lisp functions.
Imagine you have a list of cities with informations about the name, number of people living there and the size of the city:
(defparameter *cities*
'((a :people 100000 :size 200)
(b :people 200000 :size 300)
(c :people 150000 :size 210)))
Now you may want to find the city with the highest population density.
First we create a list of city names and population density using MAP:
(map 'list
(lambda (city)
(list (first city)
(/ (getf (rest city) :people)
(getf (rest city) :size))))
*cities*)
=> ((A 500) (B 2000/3) (C 5000/7))
Using REDUCE we can now find the city with the largest population density.
(reduce (lambda (a b)
(if (> (second a) (second b))
a
b))
'((A 500) (B 2000/3) (C 5000/7)))
=> (C 5000/7)
Combining both parts we get the following code:
(reduce (lambda (a b)
(if (> (second a) (second b))
a
b))
(map 'list
(lambda (city)
(list (first city)
(/ (getf (rest city) :people)
(getf (rest city) :size))))
*cities*))
Let's introduce functions:
(defun density (city)
(list (first city)
(/ (getf (rest city) :people)
(getf (rest city) :size))))
(defun max-density (a b)
(if (> (second a) (second b))
a
b))
Then we can write our MAP REDUCE code as:
(reduce 'max-density
(map 'list 'density *cities*))
=> (C 5000/7)