views:

1625

answers:

5
+21  Q: 

What is Map/Reduce

I hear a lot of noise about map/reduce, esp in the context of Google's massively parallel compute system. What exactly is it, and why is it "cool"?

+8  A: 

MapReduce Explained.

It explains better than what I can. Does it help?

artknish
+24  A: 

From the abstract of Google's MapReduce research publication page:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

The advantage of MapReduce is that the processing can be performed in parallel on multiple processing nodes (multiple servers) so it is a system that can scale very well.

Since it's based from the functional programming model, the map and reduce steps each do not have any side-effects (the state and results from each subsection of a map process does not depend on another), so the data set being mapped and reduced can each be separated over multiple processing nodes.

Joel's Can Your Programming Language Do This? piece discusses how understanding functional programming was essential in Google to come up with MapReduce, which powers its search engine. It's a very good read if you're unfamiliar with functional programming and how it allows scalable code.

See also: Wikipedia: MapReduce

Related question: Please explain mapreduce simply

coobird
Excellently explained. And to Software Monkey, M/R is incredibly easy to implement in just about anything once you understand it and isn't limited to the examples given here. There's several ways to get your head around it, one would be thinking it as collectors and funnels.
Esko
+1  A: 

there are two videos on this topic mapreduce

yesraaj
+4  A: 

Map is a function that applies another function to all the items on a list, to produce another list with all the return values on it. (Another way of saying "apply f to x" is "call f, passing it x". So sometimes it sounds nicer to say "apply" instead of "call".)

This is how map is probably written in C# (it's called Select and is in the standard library):

public static IEnumerable<R> Select<T, R>(this IEnumerable<T> list, Func<T, R> func)
{
    foreach (T item in list)
        yield return func(item);
}

As you're a Java dude, and Joel Spolsky likes to tell GROSSLY UNFAIR LIES about how crappy Java is (actually, he's not lying, it is crappy, but I'm trying to win you over), here's my very rough attempt at a Java version (I have no Java compiler, and I vaguely remember Java version 1.1!):

// represents a function that takes one arg and returns a result
public interface IFunctor
{
    object invoke(object arg);
}

public static object[] map(object[] list, IFunctor func)
{
    object[] returnValues = new object[list.length];

    for (int n = 0; n < list.length; n++)
        returnValues[n] = func.invoke(list[n]);

    return returnValues;
}

I'm sure this can be improved in a million ways. But it's the basic idea.

Reduce is a function that turns all the items on a list into a single value. To do this, it needs to be given another function func that turns two items into a single value. It would work by giving the first two items to func. Then the result of that along with the third item. Then the result of that with the fourth item, and so on until all the items have gone and we're left with one value.

In C# reduce is called Aggregate and is again in the standard library. I'll skip straight to a Java version:

// represents a function that takes two args and returns a result
public interface IBinaryFunctor
{
    object invoke(object arg1, object arg2);
}

public static object reduce(object[] list, IBinaryFunctor func)
{
    if (list.length == 0)
        return null; // or throw something?

    if (list.length == 1)
        return list[0]; // just return the only item

    object returnValue = func.invoke(list[0], list[1]);

    for (int n = 1; n < list.length; n++)
        returnValue = func.invoke(returnValue, list[n]);

    return returnValue;
}

These Java versions need generics adding to them, but I don't know how to do that in Java. But you should be able to pass them anonymous inner classes to provide the functors:

string[] names = getLotsOfNames();

string commaSeparatedNames = (string)reduce(names, 
   new IBinaryFunctor {
       public object invoke(object arg1, object arg2)
           { return ((string)arg1) + ", " + ((string)arg2); }
   }

Hopefully generics would get rid of the casts. The typesafe equivalent in C# is:

string commaSeparatedNames = names.Aggregate((a, b) => a + ", " + b);

Why is this "cool"? Simple ways of breaking up larger calculations into smaller pieces, so they can be put back together in different ways, are always cool. The way Google applies this idea is to parallelization, because both map and reduce can be shared out over several computers.

But the key requirement is NOT that your language can treat functions as values. Any OO language can do that. The actual requirement for parallelization is that the little func functions you pass to map and reduce must not use or update any state. They must return a value that is dependent only on the argument(s) passed to them. Otherwise, the results will be completely screwed up when you try to run the whole thing in parallel.

Daniel Earwicker
Overall a good answer, worth +1; didn't like the jab at Java though - but I have missed function values ever since moving to Java from C, and agree their availability is long overdue in Java.
Software Monkey
Wasn't a serious jab at Java - it has three or so flaws that are enough to make me prefer C# right now, but C# has a list of flaws too that will probably make me prerer another language someday.
Daniel Earwicker
By the way, I'd love it if someone could edit the examples so they use Java generics, if that's actually possible. Or if you can't edit then post snippets here and I'll edit.
Daniel Earwicker
I started to edit, but the map() method creates an array of the return type; Java doesn't allow creating arrays of generic types. I could have changed it to use a list (and possibly convert it to an array), but I ran out of ambition right about then.
Michael Myers
The closure syntax similar to (a, b) => a + ", " + b was something I was really looking forward to in Java 7, especially with some of the new API stuff that looks like it will go in. That syntax would have made stuff like this a lot cleaner; too bad it doesn't look like it will happen.
Adam Jaskiewicz
@mmyers - I think the full equivalent to C# would involve writing a class that implements Iterable<T>, in order to return a state machine object implementing Iterator<T>, i.e. doing everything encapsulated by 'yield return'. The resulting code would be pretty obscure, but it would have lazy eval.
Daniel Earwicker