I'm trying to use Dumbo/Hadoop to calculate TF-IDF for a bunch of small text files using this example http://dumbotics.com/2009/05/17/tf-idf-revisited/
To improve efficiency, I've packaged the text files into a sequence file using Stuart Sierra's tool -- http://stuartsierra.com/2008/04/24/a-million-little-files
The sequence file uses my original filenames (324324.txt [the object_id.txt]) as the key and the file contents as the value.
Problem is that each line of output looks like:
[aftershocks, s3://mybucket/input/test-seq-file] 7.606329176204189E-4
What I want is:
[aftershocks, 324324.txt] 7.606329176204189E-4
What am I doing wrong?
I'm running the job with:
dumbo start tfidf.py -hadoop /home/hadoop -input s3://mybucket/input/
test-seq-file -output s3://mybucket/output/test3 -param doccount=11 - outputformat text