views:

297

answers:

2

Hi I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects. Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files. How do I achieve this ?

Any pointers will be much appreciated.

thanks

+1  A: 

I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.

In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.

In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.

I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.

Karl Anderson
A: 

Do you also know the MultipleOutputFormat? It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.

Peter Wippermann