views:

101

answers:

2

I have a large set of text files in an S3 directory. For each text file, I want to apply a function (an executable loaded through bootstrapping) and then write the results to another text file with the same name in an output directory in S3. So there's no obvious reducer step in my MapReduce job.

I have tried using NONE as my reducer, but the output directory fills with files like part-00000, part-00001, etc. And there are more of these than there are files in my input directory; each part- files represents only a processed fragment.

Any advice is appreciated.

A: 

it seems from what i've read about hadoop is that you need a reducer even if it doesn't change the mappers output just to merge the mappers outputs

Dan D
A: 

Hadoop provides a reducer called the Identity Reducer.

The Identity Reducer literally just outputs whatever it took in (it is the identity relation). This is what you want to do, and if you don't specify a reducer the Hadoop system will automatically use this reducer for your jobs. The same is true for Hadoop streaming. This reducer is used for exactly what you described you're doing.

I've never run a job that doesn't output the files as part-####. I did some research and found that you can do what you want by subclassing the OutputFormat class. You can see what I found here: http://wiki.apache.org/hadoop/FAQ#A27. Sorry I don't have an example.

To site my sources, I learned most of this from Tom White's book: http://www.hadoopbook.com/.

gnucom