How do I control output files name and content of an Hadoop streaming job?

views:

696

answers:

How do I control output files name and content of an Hadoop streaming job?

Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file would only contain values for one key and its name would be the key.

Update: Just found the answer - Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.htmlhttp://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

I havent seen any samples for this out there... Can anyone point out to an Hadoop Streaming sample that makes use of a custom output format Java class?

In general, Hadoop would have you consider the entire directory to be the output, and not an individual file. There's no way to directly control the filename, whether using Streaming or regular Java jobs.

However, nothing is stopping you from doing this splitting and renaming yourself, after the job has finished. You can $HADOOP dfs -cat path/to/your/output/directory/part-*, and pipe that to a script of yours that splits content up by keys and writes it to new files.

Ilya Haykinson 2009-05-24 16:39:57

+1 A:

Using a Java class that derives from MultipleOutputFormat as the jobs output format allows control of the output file names. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.htmlhttp://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

When using Hadoop Streaming, since only one JAR is supported you actually have to fork the streaming jar and put your new output format classes in it for streaming jobs to be able to reference it...

Eran Kampf 2009-08-05 02:29:56

I'm still trying to figure this out, if anyone has a more concrete example on how to package a new hadoop-streaming.jar with a custom MultipleOurputFormat.java I would appreciate it.

Luis 2010-05-18 20:21:28

ansaurus

tags:

views:

answers:

How do I control output files name and content of an Hadoop streaming job?

related questions