ansaurus

Question

Hadoop searching words from one file in another file

Answer 1

A:

You'll want to do this in two stages, in my opinion. Run the wordcount program (included in the hadoop examples jar) against the two initial documents, this will give you two files, each containing a unique list (with count) of the words in each document. From there, rather than using hadoop do a simple diff on the two files which should answer your question,

dangerstat 2010-01-24 18:39:57

Answer 2

A:

Are you using Hadoop/MapReduce for a specific reason to solve this problem? This sounds like something more suited to a Lucene based application than Hadoop.

If you have to use Hadoop I have a few suggestions:

Your 'documents' will need to be in a format that MapReduce can deal with. The easiest format to use would be a CSV based file with each word in the document on a line. Having PDF etc will not work.
To take a set of words as input to you MapReduce job to compare against the data that the MapReduce processes you could use the Distributed Cache to enable each mapper to build a set of words you want to find in the input. However if your list of words to find it large (you mention 200MB) I doubt this would work. This method is one of the main ways you can do a join in MapReduce however.

The indexing method mentioned in another answer here does also offer possibilities. Again though, the terms indexing a document just make me think Lucene and not hadoop. If you did use this method you would need to make sure the key value contains a document identifier as well as the word, so that you have the word counts contained within each document.

I don't think i've ever produced multiple output files from a MapReduce job. You would need to write some (and it would be very simple) code to process the indexed output into multiple files.

Binary Nerd 2010-01-24 23:06:42

Answer 3

+4 A:

How I would do it:

split value in 'map' by words, emit (<word>, <source>) (*1)
you'll get in 'reduce': (<word>, <list of sources>)
check source-list (might be long for both/all sources)
if NOT all sources are in the list, emit every time (<missingsource>, <word>)
job2: job.setNumReduceTasks(<numberofsources>)
job2: emit in 'map' (<missingsource>, <word>)
job2: emit for each <missingsource> in 'reduce' all (null, <word>)

You'll end up with as much reduce-outputs as different <missingsources>, each containing the missing words for the document. You could write out the <missingsource> ONCE at the beginning of 'reduce' to mark the files.

(*1) Howto find out the source in map (0.20):

private String localname;
private Text outkey = new Text();   
private Text outvalue = new Text();
...
public void setup(Context context) throws InterruptedException, IOException {
    super.setup(context);

    localname = ((FileSplit)context.getInputSplit()).getPath().toString();
}

public void map(Object key, Text value, Context context)
    throws IOException, InterruptedException {
...
    outkey.set(...);
    outvalue.set(localname);
    context.write(outkey, outvalue);
}

Leonidas 2010-01-25 09:43:24

Awesome..thank you very much.

Algorist 2010-01-26 02:06:41

ansaurus

tags:

views:

answers:

Hadoop searching words from one file in another file

related questions