views:

80

answers:

1

I've been trying to use Hadoop to send N amount of lines to a single mapping. I don't require for the lines to be split already.

I've tried to use NLineInputFormat, however that sends N lines of text from the data to each mapper one line at a time [giving up after the Nth line].

I have tried to set the option and it only takes N lines of input sending it at 1 line at a time to each map:

    job.setInt("mapred.line.input.format.linespermap", 10);

I've found a mailing list recommending me to override LineRecordReader::next, however that is not that simple, as that the internal data members are all private.

I've just checked the source for NLineInputFormat and it hard codes LineReader, so overriding will not help.

Also, btw I'm using Hadoop 0.18 for compatibility with the Amazon EC2 MapReduce.

+1  A: 

You have to implement your own input format. You also have the possibility to define your own record reader then.

Unfortunately you have to define a getSplits()-method. In my opinion this will be harder than implementing the record reader: This method has to implement a logic to chunk the input data.

See the following excerpt from "Hadoop - The definitive guide" (a great book I would always recommend!):

Here’s the interface:

public interface InputFormat<K, V> {
  InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
  RecordReader<K, V> getRecordReader(InputSplit split,
                                     JobConf job, 
                                     Reporter reporter) throws IOException;
}

The JobClient calls the getSplits() method, passing the desired number of map tasks as the numSplits argument. This number is treated as a hint, as InputFormat imple- mentations are free to return a different number of splits to the number specified in numSplits. Having calculated the splits, the client sends them to the jobtracker, which uses their storage locations to schedule map tasks to process them on the tasktrackers.

On a tasktracker, the map task passes the split to the getRecordReader() method on InputFormat to obtain a RecordReader for that split. A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function. A code snippet (based on the code in MapRunner) illustrates the idea:

K key = reader.createKey();
V value = reader.createValue();
while (reader.next(key, value)) {
  mapper.map(key, value, output, reporter);
} 
Peter Wippermann
That kinda works. But that really doesn't answer the question. There is an issue with adding new InputFormats under 18.3.
monksy
Ok I'm sorry. Indeed there is no real question, since I see no question mark :-P So what else do you need to know more specific?
Peter Wippermann