ansaurus

Question

Answer 1

A:

Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:

job.setNumReduceTasks(<number>);

But. Your infrastructure has to support the multiple reducers, meaning that you have to

have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly

And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:

the key_map-output have either to be partitionable by %numreducers OR you have to define your own partitioner: job.setPartitionerClass(...) for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)

You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...

Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map. Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.

Leonidas 2010-02-26 00:00:57

As far as I can tell; OP is asking about "Having multiple reducer implementations" and you're talking about "Multiple instances of the same reducer code". Which is something completely different.

Niels Basjes 2010-08-01 09:03:23

Answer 2

A:

Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.

You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.

It's possible to do that, but not in trivial way in one MR.

Victor 2010-02-26 05:07:50

but if I add a new key to the output with the `context.write()` method it will multiple the data transferring from the `Mapper` objects. it solves only the file reading problem, not?

KARASZI István 2010-02-26 11:03:39

then i'd suggest to output the mapped data as files, and use another MRs to process these files.

Victor 2010-02-27 03:17:51

Answer 3

+2 A:

Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.

Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.

Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.

Binary Nerd 2010-02-26 22:33:04

+1: Thank you for making me aware of the MultipleOutputs and MultipleOutputFormat functionality.

Niels Basjes 2010-08-01 09:06:12

Answer 4

A:

I still dont get your problem you can use following sequence:

database-->map-->reduce(use cat or None depending on requirement) then store the data representation you have extracted. if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.

Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

2010-02-28 06:10:31

map reduce is a paradigm for doing a single process faster by utilizing multiple machines, but doing different things using same data isnt map reduce.Also single map and multiple reduce dont make any sense.At most you can do is use map1->reduce1->map2(do the work)->reduce2The map2 should do the single function on multiple splits of the data.

2010-02-28 06:14:35

ansaurus

tags:

views:

answers:

Hadoop one Map and multiple Reduce

related questions