I have a job in Hadoop 0.20 that needs to operate on large files, one at a time. (It's a pre-processing step to get file-oriented data into a cleaner, line-based format more suitable for MapReduce.)
I don't mind how many output files I have, but each Map's output can be in at most one output file, and each output file must be sorted.
- If I run with numReducers=0, it runs quickly, and each Mapper writes out its own output file which is fine - but the files aren't sorted.
- If I add one reducer (plain Reducer.class) this adds an unnecessary global sort step to a single file, which takes many hours (much longer than the Map tasks take).
- If I add multiple reducers, the results of individual map jobs are mixed together so one Map's output ends up in multiple files.
Is there any way to persuade Hadoop to perform a map-side sort on the output of each job, without using Reducers, or any other way of skipping the slow global merge?