views:

70

answers:

2

I'm learning Apache Hadoop and I was looking at the WordCount example org.apache.hadoop.examples.WordCount. I've understand this example, however I can see that the variable LongWritable key was not used in

(...)
public void map(LongWritable key, Text value, 
                    OutputCollector<Text, IntWritable> output, 
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        output.collect(word, one);
      }
    }
(...)

what is the use of this variable ? Could someone give me a simple example where it would be used ? Thanks

+1  A: 

I could be wrong (I have read map/reduce tutorials but haven't used it for real projects yet), but I think in general it is the identifier of input entry; for example, tuple (file name, line number). In this particular case it's supposedly line number, and it's of no interest for word counts. It could be used if the idea was to, say, aggregate word counts on per-line, not per-file basis (or for multiple files if key did contain that info).

StaxMan
+1  A: 

When the InputFormat is TextInputFormat, the Key is the bytes offset from the beginning of the current input file.

Value is simply the line of text at that offset.

If SequenceFileInputFormat was used, the Key would be whatever was stuffed into the Key position of the 'record. The same for Value.

Bottom line is that they Key/Value types are dependent on the input type (text, sequence file, etc).

ckw

cwensel