I am trying to output the results of my reducer to multiple files. The data results are all contained in one file, and the rest of the results are split based on a category in their respected files. I know with 0.18 that you can do this with MultipleOutputs and it has not been removed. However, I am trying to make my application 0.20+ compliant. The existing Multiple outputs functionality still requires JobConf (which my application uses Job, and Configuration). How can I generate multiple outputs based on the key?
+2
A:
Support for MultipleOutputs isn't in 0.20. You will need to use the older API.
It has been added into 0.21 which is currently unreleased as org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.
This thread on the mailing list talks about this problem.
Binary Nerd
2010-02-01 23:41:55
That is incredibly frustrating and stupid. That seems like a fundamental thing that is needed in the program.
monksy
2010-02-02 04:37:08
Yeah. A lot of work is going on toward having the correct API interface for 1.0
Steve
2010-02-02 16:45:07
A:
You can do this in Hadoop 0.20, just that as mentioned you have to use the older API.
There's some very rough code to do so in http://github.com/orngejaket/Info_Moist_1_Splicer/tree/master/src/contrib/streaming/src/java/org/infochimps/hadoop/mapred/lib/
The resulting jar writes each record to a file named after its (sanitized) key.
mrflip
2010-02-03 01:06:27