views:

833

answers:

4

We have a large dataset to analyze with multiple reduce functions.

All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.

Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.

Thanks!

A: 

Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:

job.setNumReduceTasks(<number>);

But. Your infrastructure has to support the multiple reducers, meaning that you have to

  1. have more than one cpu available
  2. adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly

And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:

  • the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner: job.setPartitionerClass(...) for example with a random-partitioner ...
  • the data must be reduce-able in the partitioned format ... (references needed?)

You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...

Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map. Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.

Leonidas
As far as I can tell; OP is asking about "Having multiple reducer implementations" and you're talking about "Multiple instances of the same reducer code". Which is something completely different.
Niels Basjes
A: 

Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.

You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.

It's possible to do that, but not in trivial way in one MR.

Victor
but if I add a new key to the output with the `context.write()` method it will multiple the data transferring from the `Mapper` objects. it solves only the file reading problem, not?
KARASZI István
then i'd suggest to output the mapped data as files, and use another MRs to process these files.
Victor
+2  A: 

Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.

Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.

Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.

Binary Nerd
+1: Thank you for making me aware of the MultipleOutputs and MultipleOutputFormat functionality.
Niels Basjes
A: 

I still dont get your problem you can use following sequence:

database-->map-->reduce(use cat or None depending on requirement) then store the data representation you have extracted. if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.

Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.

map reduce is a paradigm for doing a single process faster by utilizing multiple machines, but doing different things using same data isnt map reduce.Also single map and multiple reduce dont make any sense.At most you can do is use map1->reduce1->map2(do the work)->reduce2The map2 should do the single function on multiple splits of the data.