I am having a few million words which I want to search in a billion words corpus. What will be the efficient way to do this.
I am thinking of a trie, but is there an open source implementation of trie available?
Thank you
-- Updated --
Let me add few more details about what exactly is required.
We have a system where we crawled news sources and got the popular words based on the frequency of the words. There can be a million words.
Our data will look something like this.
Word1 Frequency1 Word2 Frequency2 (Tab delimited)
We also got most popular words(1 billion) from another source, which also contains data in the above format.
Here is what I would like to get as output.
- Words common to both the sources
- Words only present in our source but not in reference source.
- Words only present in reference source but not in our source.
I am able to use comm(bash command) to the above information for only the words. I don't know how to use comm to compare only against one column rather than both columns.
The system should be scalable and we would like to perform this on every day basis and compare the results. I also would like to get approximate matches.
So, I am thinking of writing a map reduce job. I am planning to write the map and reduce function as below, but I have few questions.
Map
For each word
output key = word and value = structure{ filename,frequency}
done
Reduce
For each key
Iterate through all the values and check if both file1 and file2 are contained.
If yes, then write it to appropriate file.
If only in file1, write it to file1only file
If only in file2, write it to file2only file.
Done.
I have two questions. In the map reduce, I can give as input a directory containing my two files. I don't know how to get the filename from which I am reading the words. How to get this information? How can write to different output files, because reduce phase automatically writes to only default file named as part-xxxxx. How to write to different output files.
Thanks for reading this.