views:

175

answers:

3

Hi,

I am reading about MapReduce and the following thing is confusing me.

Suppose we have a file with 1 million entries(integers) and we want to sort them using MapReduce. The way i understood to go about it is as follows:

Write a mapper function that sorts integers. So the framework will divide the input file into multiple chunks and would give them to different mappers. Each mapper will sort their chunk of data independent of each other. Once all the mappers are done, we will pass each of their results to Reducer and it will combine the result and give me the final output.

My doubt is, if we have one reducer, then how does it leverage the distributed framework, if, eventually, we have to combine the result at one place?. The problem drills down to merging 1 million entries at one place. Is that so or am i missing something?

Thanks, Chander

+5  A: 

Check out merge-sort.

It turns out that sorting partially sorted lists is much more efficient in terms of operations and memory consumption than sorting the complete list.

If the reducer gets 4 sorted lists it only needs to look for the smallest element of the 4 lists and pick that one. If the number of lists is constant this reducing is an O(N) operation.

Also typically the reducers are also "distributed" in something like a tree, so the work can be parrallelized too.

Peter Tillemans
And reducer can start giving results when it gets the first result from each mapper allowing (in the case of a merge sort) do the process (merging) while giving the output, it's a huge improvement in time and memory.
helios
It's only constant if you always use the same number of mappers. Generically speaking, it's O( M log N) to merge M elements in N lists if you use a min-heap, and O(M * N) for the "naive" approach. But yeah, as you would expect M >> N, it's basically linear.
SquareCog
There is also a practical cnsideration that in the "short" term your resources i.e. CPU cores and boxes, is constant and it requires management approval to increase M. Hence M looks like Aztec pyramid with several 'constant' steps.
Peter Tillemans
The number of mappers (and so, N) is limited not by the available cores, but by the size of the data -- in Hadoop, at least. You just wind up having multiple "waves" of mappers if you have more tasks than map slots.
SquareCog
+1  A: 

I think, combining multiple sorted items is efficient than combining multiple unsorted items. So mappers do the task of sorting chunks and reducer merges them. Had mappers not done sorting, reducer will have tough time doing sorting.

Gopi
+2  A: 

As others have mentioned, merging is much simpler than sorting, so there's a big win there.

However, doing an O(N) serial operation on a giant dataset can be prohibitive, too. As you correctly point out, it's better to find a way to do the merge in parallel, as well.

One way to do this is to replace the partitioning function from the random partitioner (which is what's normally used) to something a bit smarter. What Pig does for this, for example, is sample your dataset to come up with a rough approximation of the distribution of your values, and then assign ranges of values to different reducers. Reducer 0 gets all elements < 1000, reducer 1 gets all elements >= 1000 and < 5000, and so on. Then you can do the merge in parallel, and the end result is sorted as you know the number of each reducer task.

SquareCog