Hi,
I have to search a 25 GB corpus of wikipedia for a single word. I used grep but it takes lot of time. Is there a efficient and easy representation that can be made to search quickly. Also, I want to find exact match.
Thank you.
Hi,
I have to search a 25 GB corpus of wikipedia for a single word. I used grep but it takes lot of time. Is there a efficient and easy representation that can be made to search quickly. Also, I want to find exact match.
Thank you.
You would probably want to do an index of a mapping from word to list of locations (bytecode offsets). The list of words would be sorted alphabetically. You could then have a secondary index of where certain letters start in this large list of words.
Lazy hash | Word index | Corpus
aaa starts at X | aaa | lorem ipsum dolor
aab starts at Y | ... | sit amet .....
aac ... | and 486, 549, 684, ... | ...
... ... | |
zzz ... | |
This is the way advocated by the natural language professor at my department (we did this exercise as a lab in an algorithm course).
I've had success with the Boyer-Moore algorithm and its simplified version. There are implementations for various languages floating around the web.
Have you tried using an indexing engine... say, Lucene with Nutch? Lucene is indexing engine. Nutch is web crawler. Combine the power!
I forgot to mention... CouchDB (http://couchdb.apache.org/)
@aloobe had the answer of using an index file that mapped words to locations. I just want to expound upon this, though I think the answer the OP is looking for may just be Boyer-Moore.
The index file would look like this (simplified to use human-readable 2-digits):
53 17 89 03
77 79 29 39
88 01 05 15
...
Each entry above is the byte offset of a word or letter that you've deemed important enough to index. In practice, you won't use letter indices as then your index file is larger than your corpus!
The trick is, if you were to substitute the words at those locations with the locations, your index file would be an alphabetically-sorted version of the corpus:
and and are as
ate bad bat bay
bear best bin binge
This enables you to do Binary Search on the corpus through the index file. If you are searching for the word "best" above, you would grab the middle entry in the index file, 79. Then you would go to position/byte 79 in the corpus and see what word is there. It is bad
. We know that alphabetically best > bad
, so the position must be in the 2nd half of the index file.
So we grab the middle index between 79 (6th) and 15 (12th), which is 01 in my example. Then we look at position/byte 88 (9th) in the corpus to find bear
. best > bear
so we try again - the middle index now is either 01 (10th) or 05 (11th) depending on how you round. But clearly we'll find best
in 1 or 2 more searches. If we have 12 words like the example, it will take at most 4 searches in the worst case. For a 25GB file with an average word length of say 5 letters and spaces between them, that's ~4 billion words. However, in the worst case scenario you will only search ~32 times. At that point, more of your program's time is spent spinning up the disk and buffering input than actually searching!
This method works with duplicate words as well. If you want to find all of the locations of the word the
, you would binary search on the
until you found the index. Then you would subtract 1 from the position in the index file repeatedly, using the value each time to look into the corpus. If the word at that location is still the
, continue. When you finally stop, you have the earliest index in the index file that maps to the
.
The creation of the index file is the only tough part. You need to go through each word in the corpus, building up a data structure of the words and their indices. Along the way, skip words that are too common or short to be listed, like "a", "I", "the", "and", "is", etc. When you are finished, you can take that data structure and turn it into an index file. For a 25GB file, your indices will need to be > 32 bits, unfortunately, so use a long
(in Java) or long long
(in C) to hold it. There's no reason it should be human readable for you, so write the indices out as 64 bit values, not strings.
The structure I would recommend is a self-balancing binary search tree. Each node is a string value (the word) and index. The tree compares nodes based only on the string, however. If you do this, then in-order traversal (left, node, right) will give you exactly the index file.
Hope this helps! An example I used years ago developing a mobile phone dictionary is Jim Breen's EDICT. It may be difficult to pick up because of the EUC-encoding and Japanese characters, but the intent is the same.