views:

185

answers:

9

I am having a few million words which I want to search in a billion words corpus. What will be the efficient way to do this.

I am thinking of a trie, but is there an open source implementation of trie available?

Thank you

-- Updated --

Let me add few more details about what exactly is required.

We have a system where we crawled news sources and got the popular words based on the frequency of the words. There can be a million words.

Our data will look something like this.

Word1 Frequency1 Word2 Frequency2 (Tab delimited)

We also got most popular words(1 billion) from another source, which also contains data in the above format.

Here is what I would like to get as output.

  • Words common to both the sources
  • Words only present in our source but not in reference source.
  • Words only present in reference source but not in our source.

I am able to use comm(bash command) to the above information for only the words. I don't know how to use comm to compare only against one column rather than both columns.

The system should be scalable and we would like to perform this on every day basis and compare the results. I also would like to get approximate matches.

So, I am thinking of writing a map reduce job. I am planning to write the map and reduce function as below, but I have few questions.

Map
For each word
output key = word and value = structure{ filename,frequency}
done

Reduce
For each key
Iterate through all the values and check if both file1 and file2 are contained.
If yes, then write it to appropriate file.
If only in file1, write it to file1only file
If only in file2, write it to file2only file.
Done.

I have two questions. In the map reduce, I can give as input a directory containing my two files. I don't know how to get the filename from which I am reading the words. How to get this information? How can write to different output files, because reduce phase automatically writes to only default file named as part-xxxxx. How to write to different output files.

Thanks for reading this.

A: 

If I were doing this in Java, I'd use a HashMap. Wikipedia suggests that a trie is marginally better in some cases, but I'm not sure that you'll see much difference.

John
A billion words is a *lot* of data. I think a trie would be worthwhile here, primarily for the decreased memory usage through combining prefixes.
Anon.
I think the size of the words themselves and the data structure indexing them will fit comfortably in memory. There might be a million distinct words. What's unmanageably large here is the data about all the places where each word is used (which has to be a billion entries in some form).
Jason Orendorff
Actually our misaligned concerns here are probably due to vagueness in the question.
Jason Orendorff
A: 

This looks like a job for which the Aho-Corasick string search algorithm was designed for. I have never coded it myself, but googling a little should turn up some code.

Rabin-Karp might also work, but I have no idea how it works for multiple patterns when they are not all of the same length. Note: the multi-pattern pseudocode in the wikipedia article appears to be wrong. But should give you a starting point.

MAK
A: 

In the spirit of quick and dirty:

fgrep --mmap -f query-file corpus-file
Tobu
A: 

Data structure that is used in text search engines is called inverted index. And as it have been said, very good open source search engine is Lucene.

ton4eg
+2  A: 

With MapReduce you shouldn't try and do everything in single step or job. It looks like you should split this problem up into multiple steps. Since you are generating the data that's stored on the HDFS, and you need to know the source you should probably go for a format something like:

{SOURCE},{WORD},{FREQUENCY}

Remember that you are talking about a distributed file system, so refering to your inputs as file1 and file2 isn't technically correct. Both your reference data and source data will be spread throughout the cluster, with pieces of each located on each node.

Next, starting with your pseudo code example you will need to create a job which correlates a word to the source and its frequency. Your mapper will work just fine, but the reduce will need to link the words to the sources. You will need to create your own Writable object which contains Map< source, frequency >. This will be output onto the HDFS as intermediate data your follow-on filter jobs can work with.

You can then use the output from this step as the input to 3 different MapReduce jobs. Where each is looking for the different combinations of sources. These jobs will be very simple, since the mapper will just pass through the same data, but the reducer will check each value for the different combinations of sources.

So if you take this approach you will need 4 MapReduce jobs. You don't need to run each one by hand, you can have a single job which runs each job sequentially. Alternatively, since the final 3 jobs will be using the same input data, you could start those three at the same time once the first has finished. This will probably depend on the amount of data and intermediate data your cluster is able to manage, and the number of mapper/reducers each job will require.

Hope this suggestion helps.

Binary Nerd
Your suggestions are very much useful.I just have few clarifications. Do you suggest me to keep the source also as part of the input data? After the first map reduce job, will I be running the other 3 map reduce jobs at the same time on the same input data?If all three jobs read the same data and leave some data based on the output I require, don't you think its going to be little inefficient? Is there a way to improve this? I am thinking the other three map reduce jobs should do nothing in map, but do apply the required logic in reduce phase. Is that true?Thanks once again.
Algorist
Hi, yes I would suggest you have the source of the data as a field, which would also allow a more flexible implementation, thus allowing you to add more sources without needing to modify you code. The first job would index the words for you, so the intermediate data it would generate would be quite large. You would need to make sure you have sufficient room in you HDFS to accommodate it. Once the filtering jobs have run it can be deleted though, and yes the final 3 jobs would only need to have logic in the reduce. The map would do nothing.
Binary Nerd
Thinking about it, you could also just merge the index and a single filter step together. You wouldn't need the intermediate data then. You would have three jobs each answering a single question. I think my main point is, don't try and do too much in a single job.
Binary Nerd
A: 

I'm not sure about its performance, but Python's nltk was designed to do this sort of thing: to tokenize large text corpora and allow you to make comparisons between them. The book "Natural Language Processing with Python" makes use of this toolkit and has many examples. It is available online for free.

Alison R.
A: 

A tokenizer.c compiled to a.out can tokenize the corpus, then use systemclose shell script for efficient performance

 ./a.out <
/live/memory/var/cache/man/whatis  | sort | awk {'print $1'} | uniq -c
| sort -rn > file.txt
LarsOn
A: 

A desktop PC can do this. The smaller data set will fit in memory, and that's all you need.

In Python:

# Load the words from the small file into one big hash set
small_set = set(line.split()[0] for line in open("small.txt", "r"))

# Open 3 output files.
f1 = open("common.txt", "w")
f2 = open("large_only.txt", "w")
f3 = open("small_only.txt", "w")

# Find all words in the large set that aren't in the small set.
for line in open("large.txt", "r"):
    word = line.split()[0]
    if word in small_set:
        f1.write(line)  # word is in both sets
        small_set.remove(word)
    else:
        f2.write(line)  # word is in large but not small

# Everything left over in small_set wasn't in the large_set.
for word in small_set:
    f3.write(word + "\n")

A cluster can do it faster. But you can try this at home.

Jason Orendorff
A: 

Since you can use comm, I think you must have sorted input files.

Here is a program like comm that looks at the first column only, but produces output containing the whole line of input. It only works if the input is sorted!

This is a complete program. All you have to do is put this in a text file and run it from the command line.

#!/usr/bin/env python
#
# comm.py - Compare 2 sorted files line by line, based on the first column.
# Usage:   python compare.py FILE1 FILE2 OUTFILE1 OUTFILE2 OUTFILE12
# OUTFILE1 receives all entries that are only in FILE1, etc.

import sys

def compare(f1, f2, out1, out2, out12):
    def get(f):
        line = f.readline()
        if line == '':
            return None
        first, rest = line.rstrip('\n').split('\t', 1)
        return first, rest, line

    e1 = get(f1)
    e2 = get(f2)
    while e1 and e2:
        if e1[0] == e2[0]:   # common entry
            out12.write(e1[0] + "\t" + e1[1] + "\t" + e2[1] + "\n")
            e1 = get(f1)
            e2 = get(f2)
        elif e1[0] < e2[0]:  # e1 is not in f2
            out1.write(e1[2])
            e1 = get(f1)
        else:                # e2 is not in f1
            out2.write(e2[2])
            e2 = get(f2)
    if e1:
        buf = e1[2]
        while buf:
            out1.write(buf)
            buf = f1.read(8192)
    if e2:
        buf = e2[2]
        while buf:
            out2.write(buf)
            buf = f2.read(8192)

compare(open(sys.argv[1], "r"),
        open(sys.argv[2], "r"),
        open(sys.argv[3], "w"),
        open(sys.argv[4], "w"),
        open(sys.argv[5], "w"))
Jason Orendorff