ansaurus

Question

Hadoop MapReduce job on file containing HTML tags

Answer 1

+1 A:

You can reproduce the bug even with just:

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

The issue is here:

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

If word is a single punctuation mark, as would be the case for the above input (after it is split), then it is converted to an empty string. So, just move the check for an empty string to after the replacement.

codelogic 2009-12-03 21:53:22

Is it safe to assume that if you use cat and you get the desired output, that the MapReduce step will work?

rohanbk 2009-12-04 02:44:31

For a more pleasant python/Hadoop integration experience you can consider using Dumbo.

Ranieri 2009-12-22 15:50:27

ansaurus

tags:

views:

answers:

Hadoop MapReduce job on file containing HTML tags

related questions