views:

39

answers:

2

Hi all, I'm a beginer in hadoop. I've understood the WordCount program. Now I have a problem. I dont want the output of all the words..

- Words_I_Want.txt -
hello
echo
raj

- Text.txt -
hello eveyone. I want hello and echo count


output should be
hello 2
echo 1
raj 0


Now that was an exaple, My actual data is very large.

A: 

In the WordCount example, the Mapper outputs each tokenized word from the input value and the number 1:

while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    output.collect(word, one);
}

If you only want to count certain words, then wouldn't you want to only output words from your Mapper that are matches against your list?

while (tokenizer.hasMoreTokens()) {
    String token = tokenizer.nextToken();
    if (wordsThatYouCareAbout.contains(token)) {
        word.set(token);
        output.collect(word, one);
    }
}
matt b
my data is too big to consider it as a `String wordsThatYouCareAbout;`
echo
Sorry for not making it clear but I was assuming that you would use some sort of `Set` or `Collection`, not a string. How large is the data?
matt b
As long as you can fit file `Words_I_Want.txt` in RAM this can run smoothly.
Wojtek
So in every `map()` function i've to load this set of words from `Words_I_Want.txt` (assuming i can fit it into the memory) and check for `if (wordsThatYouCareAbout.contains(token))` ? This idea looks bad for the reason ** I've to read the Words_I_Want.txt every time in the the `map()` function **
echo
A: 

matt b's answer is definitely good for large to small joins but let's assume you're doing a large to large join.

You can map Words_I_Want.txt: k: the word, v: some marker

You can then map Text.txt: k: the word, v: 1 (same as the standard word count)

You'll have to use MultipleInputs and figure out which file is which using conf.get("map.input.file").

Then in the reduce step you can only collect output when the key has a marker.

Jieren