views:

278

answers:

3

I read the mapreduce at http://en.wikipedia.org/wiki/MapReduce ,understood the example of how to get the count of a "word" in many "documents". However I did not understand the following line:

Thus the MapReduce framework transforms a list of (key, value) pairs into a list of values. This behavior is different from the functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.

Can someone elaborate on the difference again(MapReduce framework VS map and reduce combination)? Especially, what does the reduce functional programming do?

Thanks a great deal.

+1  A: 

Using the word count example, the original functional map() would take a set of documents, optionally distribute subsets of that set, and for each document emit a single value representing the number of words (or a particular word's occurrences) in the document. A functional reduce() would then add up the global counts for all documents, one for each document. So you get a total count (either of all words or a particular word).

In MapReduce, the map would emit a (word, count) pair for each word in each document. A MapReduce reduce() would then add up the count of each word in each document without mixing them into a single pile. So you get a list of words paired with their counts.

Max Shawabkeh
+2  A: 

The main difference would be that MapReduce is apparently patentable. (Couldn't help myself, sorry...)

On a more serious note, the MapReduce paper, as I remember it, describes a methodology of performing calculations in a massively parallelised fashion. This methodology builds upon the map / reduce construct which was well known for years before, but goes beyond into such matters as distributing the data etc. Also, some constraints are imposed on the structure of data being operated upon and returned by the functions used in the map-like and reduce-like parts of the computation (the thing about data coming in lists of key/value pairs), so you could say that MapReduce is a massive-parallelism-friendly specialisation of the map & reduce combination.

As for the Wikipedia comment on the function being mapped in the functional programming's map / reduce construct producing one value per input... Well, sure it does, but here there are no constraints at all on the type of said value. In particular, it could be a complex data structure like perhaps a list of things to which you would again apply a map / reduce transformation. Going back to the "counting words" example, you could very well have a function which, for a given portion of text, produces a data structure mapping words to occurrence counts, map that over your documents (or chunks of documents, as the case may be) and reduce the results.

In fact, that's exactly what happens in this article by Phil Hagelberg. It's a fun and supremely short example of a MapReduce-word-counting-like computation implemented in Clojure with map and something equivalent to reduce (the (apply + (merge-with ...)) bit -- merge-with is implemented in terms of reduce in clojure.core). The only difference between this and the Wikipedia example is that the objects being counted are URLs instead of arbitrary words -- other than that, you've got a counting words algorithm implemented with map and reduce, MapReduce-style, right there. The reason why it might not fully qualify as being an instance of MapReduce is that there's no complex distribution of workloads involved. It's all happening on a single box... albeit on all the CPUs the box provides.

For in-depth treatment of the reduce function -- also known as fold -- see Graham Hutton's A tutorial on the universality and expressiveness of fold. It's Haskell based, but should be readable even if you don't know the language, as long as you're willing to look up a Haskell thing or two as you go... Things like ++ = list concatenation, no deep Haskell magic.

Michał Marczyk
Regarding "there are no constraints at all on the type of said value": you make it sound as if MapReduce requires a particular data structure. It does not.
Max Shawabkeh
Well, the MapReduce paper does describe the Map step as producing a list of key/value pairs. This doesn't require a particular data structure -- like a linked list or hash table -- but sure seems to require a particular structure of data -- namely a mapping between keys and values. That's why I use the latter expression in the answer. That being said, I suppose there's nothing to prevent a MapReduce-like operation to proceed on data of a different structure when appropriate... though I've no idea whether that'd fall under the wording of the patent (a *boo!* for software patents!).
Michał Marczyk
Also, there is of course no reason why the values in the key/value pairs would need to be of any particular type... I certainly never meant to imply that.
Michał Marczyk
In practice, MapReduce is very rarely used on strings. It was made originally for protocol buffer (http://code.google.com/p/protobuf/) which describe and contain arbitrary data structures. The key-value pair is simply a step that allows a single `Map()` to feed many `Reduce()` s in a single run as there's a Reducer for each key.
Max Shawabkeh
Well, that's true, however it doesn't change the fact that MapReduce as presented in both the original paper and the patent claim (the paper for sure, the claim I can hardly read for its legalese, so that's just insofar as I can tell) does involve the key/value business. The fact that it's because of engineering concerns over how to distribute things reinforces the point about MapReduce going beyond the FP map+reduce. From an abstract point of view, however, dividing the data into blocks based on the keys is just part of the reduce step. So I'd still say MapReduce is a specialisation of m+r.
Michał Marczyk
+1  A: 

MapReduce is a framework built around splitting a computation into parallelizable mappers and reducers. It builds on the familiar idiom of map and reduce - if you can structure your tasks such that they can be performed by independent mappers and reducers, then you can write it in a way which takes advantage of a MapReduce framework.

Imagine a Python interpreter which recognized tasks which could be computed independently, and farmed them out to mapper or reducer nodes. If you wrote

reduce(lambda x, y: x+y, map(int, ['1', '2', '3']))

or

sum([int(x) for x in ['1', '2', '3']])

you would be using functional map and reduce methods in a MapReduce framework. With current MapReduce frameworks, there's a lot more plumbing involved, but it's the same concept.

Karl Anderson