views:

52

answers:

2

I have a job in Hadoop 0.20 that needs to operate on large files, one at a time. (It's a pre-processing step to get file-oriented data into a cleaner, line-based format more suitable for MapReduce.)

I don't mind how many output files I have, but each Map's output can be in at most one output file, and each output file must be sorted.

  • If I run with numReducers=0, it runs quickly, and each Mapper writes out its own output file which is fine - but the files aren't sorted.
  • If I add one reducer (plain Reducer.class) this adds an unnecessary global sort step to a single file, which takes many hours (much longer than the Map tasks take).
  • If I add multiple reducers, the results of individual map jobs are mixed together so one Map's output ends up in multiple files.

Is there any way to persuade Hadoop to perform a map-side sort on the output of each job, without using Reducers, or any other way of skipping the slow global merge?

A: 

See Ben's comment below -- this doesn't work. I'll leave this wrong answer here so that we at least know what doesn't work.

I believe that's what a Combiner would do for you. I have never used them myself, but http://hadoop.apache.org/common/docs/r0.20.1/mapred_tutorial.html states (section Payload / Mapper):

Users can optionally specify a combiner, via JobConf.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.

My reading of this is that if you specified an identity reducer as the combiner, then each mapper's output should be sorted.

HD
I do have "job.setCombinerClass(Reducer.class)" in place. It doesn't seem to take effect when reducers is zero. From Mapper.java:<p>If the job has zero reduces then the output of the <code>Mapper</code> is directly written to the OutputFormat without sorting by keys.</p>So I suppose I'm asking whether there's a way to circumvent this, or get the same effect by other means.
Ben Moran
Too bad.So, could you not output anything in the mapper's `map` call, but simply stash the values to be collected in memory (use enough mappers to make sure this doesn't get too big). Then in the `cleanup` call sort the values yourself and output them then.
HD
Yes - think I will have to sort it there myself, though memory per mapper might be a problem... Thanks for the input.
Ben Moran
+2  A: 

Combiners aren't going to globally sort your data - they are basically a cache to partially aggregate reducer data.

Normally you don't want to sort each mapper's output separately, but if you do, why not add the mapper file id as part of your output and use a custom partition function so the output of each mapper is partitioned separately, and hence sorted separately, so the outputs of any mapper is always in a single file? You'd also probably want to group by the file id, so you would get the sorted output of each input file separately.

I am curious, why do you want to sort mapper output separately anyhow?

Another thought, Hadoop is actually going to do a mapper-side sort ("the shuffle") if you sort your output, so you probably could have it not delete those temporary files if you did run with many reducers.

Ron Bodkin
Ben Moran