views:

112

answers:

1

I'm trying to use Dumbo/Hadoop to calculate TF-IDF for a bunch of small text files using this example http://dumbotics.com/2009/05/17/tf-idf-revisited/

To improve efficiency, I've packaged the text files into a sequence file using Stuart Sierra's tool -- http://stuartsierra.com/2008/04/24/a-million-little-files

The sequence file uses my original filenames (324324.txt [the object_id.txt]) as the key and the file contents as the value.

Problem is that each line of output looks like:

[aftershocks, s3://mybucket/input/test-seq-file]        7.606329176204189E-4

What I want is:

[aftershocks, 324324.txt]       7.606329176204189E-4

What am I doing wrong?

I'm running the job with:

dumbo start tfidf.py -hadoop /home/hadoop -input s3://mybucket/input/

test-seq-file -output s3://mybucket/output/test3 -param doccount=11 - outputformat text

A: 

I made the following tweaks to the first mapper and everything started working.

#Original version
@opt("addpath", "yes")
def mapper1(key, value):
    for word in value.split():
        yield (key[0], word), 1

#Edits version
def mapper1(key, value):
    for word in value.split():
        yield (key, word), 1
erikcw