tags:

views:

241

answers:

7

Suppose I'm given a large dictionary in flat file with 200 million words and my function needs to check the existence of any given word in the dictionary, what's the fastest way to do it? You can't store the dictionary in the memory because you only have 1GB of memory. You can store it in the database, however querying it would still be very very slow without any optimization. You can't index the full words because you don't have enough resources.

Edit: in addition to the file optimization approach mentioned below, are there any database optimization? I'm thinking of creating partial indices, say for every 2 letters in the word up to a limit, I create an index. Would this speed up the db query?

A: 

Use The Boyer–Moore string search algorithm?

http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%5Fstring%5Fsearch%5Falgorithm

BenB
A: 

If you have no index just use a stream.

Sometimes the most simple solution is the best.

   public Int32 IndexOf(FileStream file, Byte[] ascii_bytes, Int32 start_index)
   {
       Int32 index = -1;
       {
           Int32 current = 0;
           Int64 original_index = 0;
           Boolean found = true;

           file.Position = start_index;
           current = file.ReadByte();
           while (current >= 0)
           {
               if ((Byte)current == ascii_bytes[0])
               {
                   found = true;
                   original_index = file.Position - 1;
                   for (Int32 i = 1; (i < ascii_bytes.Length && current > 0); i++)
                   {
                       current = file.ReadByte();
                       if ((Byte)current != ascii_bytes[i])
                       {
                           file.Position--;
                           found = false;
                           break;
                       }
                   }

                   if (found)
                   {
                       file.Position = original_index;
                       index = (Int32)original_index;
                       break;
                   }
               }
               current = file.ReadByte();
           }
       }
       return index;
   }
ChaosPandion
Hardly 'the fastest' way to do it
johnc
I never claimed it was fast.
ChaosPandion
+1 for the stream. -1 for the brute force search. Net result 0 Vote.
JohnFx
LOL, I don't need any of those fancy schmancy algorithms!
ChaosPandion
@Chaos He requested fast in the question
johnc
+11  A: 

Binary search

Assuming the dictionary has the words in alphabetical order, I would attempt a modified binary search. Divide and conquer the file by jumping to a midpoint location in the file and seeing what word is there. If guessed to high, split the lower in half and try again until there's no file location to attempt or the word is found.

(As outis mentioned in a comment, after jumping to a file location, you'll need to scan backwards and forwards to find the boundaries of the word you jumped to.)

You might be able to optimize this by guessing a location chunk right off the bat based on the first letter of the word. For example, if the word begins with "c" start your search around the 3/26th section of the file. Though, in reality, I think this early guess will only make a negligible difference overall.

Other optimizations could include keeping a small subset of an index. For example, keep an index of the first word that starts with each letter of the alphabet, or keep an index of each word that starts with each possible two letter combination. This would allow you to immediately narrow your search area.

O(log n)

Ryan McGeary
I like this, but it doesn't deal with the memory limitations presented.
windfinder
Why doesn't it? You could literally only load 100 bytes at a time into memory on each midpoint jump. You don't need to load the entire file into memory.
Ryan McGeary
+1, but missing one thing: an arbitrary file position will most likely be in the middle of a word. Easy to fix by scanning forwards (or backwards) until a word boundary is found.
outis
how do you jump to a given location of a file? any code sample in php or Java?
@wo_shi_ni_ba_ba: It depends on the filesystem API, but it should be a single function or method call. For PHP, there's `fseek`. In Java, you have `SeekableByteChannel.position(long)`.
outis
Rather than scanning forward/backward to find word boundaries, you can build an index of offsets into the dictionary. It might just fit entirely in memory with 32-bit offsets depending on memory needs for the OS and program. You could get away with 28-bit offsets in exchange for slightly more complicated code. Additionally, each entry in the string table could be compressed with a library such as SharpZipLib (with the compression dictionary fixed in your program). This would reduce IO as you read parts of the file from disk.
Eric J.
... Reduce IO at the cost of some CPU.
Eric J.
does fseek(n) walks through any of the data before n? or is it an intantaneous operation?
@Ryan: you can extend the position-guessing optimization. Let `dict` be the dictionary and `word` be the word being searched for. At a given iteration, suppose the first difference between the words at the lower `l` and upper `u` bounds is at position `i`. Then `m = abs(word[i]-dict[l][i])/abs(dict[u][i]-dict[l][i]) + l` is the guessed midpoint. Basically, you assume an even distribution. Still O(log(N)), but timing should involve a much smaller constant factor, giving real-world speedup.
outis
@Ryan: would this approach run under 2 seconds in the real world given a normal pc with 1 GB ram?
outis
By the way, storing the data on a flash drive could be faster than storing it on even a fairly fast hard disk. That's the basic idea behind ReadyBoost... actual IO transfers are slower on a flash drive, but "seek" time is MUCH faster. If your hard disk has a large amount of cache compared to your data set, you are probably better off with the hard disk.
Eric J.
@wo_shi_ni_ba_ba: (RE: 2 s) hard to say, as it depends too much on the speed of the PC's I/O system and the storage device. Whether it's a 5400 RPM or 15,000 RPM hard drive or a CD makes a difference. Whenever a disk needs to seek, its read head needs to be repositioned and the disk needs to rotate until the appropriate sector is under the head; both increase latency. Flash memory won't have seek times.
outis
@wo_shi_ni_ba_ba Let's assume an absolutely horrible hard drive seek time of 50ms (for simplicity of also covering any other compute time). The worst case binary search will be log base 2 of 200,000,000 which is 27.5. 27.5 x 50ms = 1.375 seconds. (Most hard drive seeks times are around 9ms.)
Ryan McGeary
A very simple optimization of this is to use fixed-length words -- sure, it'll take up more drive space, but guessing where to seek in a trivial calculation would be more than worth it. Use the largest word as the size.If you can use fixed size words + sorted words, this becomes a fairly easy and efficient binary search problem.If your'e talking about multiple languages and unicode, where you can't necessarily group all the words in one sort -- well, you're talking about a world of hurt.
Vineel Shah
+4  A: 

This is a classic use case for a Bloom filter. A Bloom filter is a probabilistic data structure which is optimized for membership tests ("is X a member of this collection?"), and provides O(1) lookup. In exchange, you introduce an arbitrarily small probability of a false positive -- that is, the filter will say a particular word is present, but it is actually not there. The more memory you use, the smaller you can make this probability. However, the probability of false negatives is zero: the filter will never say that a word is absent if it is actually present.

In your specific case, with 8 billion bits (1 GB) to work with, you can get a false positive rate a little better than 1 in every 1,000,000,000 trials. That's an extremely low false positive rate. If you looked up 200 million random strings, the probability that you'd never hit a single false positive is about 82%.

This does not require the dictionary to be sorted, is highly space-efficient, and does not need a database or other ancillary storage structure. Overall, it's probably a good choice for your needs.

John Feminella
I really like this as a suggestion, but the question makes it seem as though there isn't room for false positives.
Ryan McGeary
I don't see anything leaning one way or the other on this. But I'll clarify my response to add that it's only suitable if you're okay with a tiny probability of false positives. With 8 billion bits to work with, you can get a false positive rate a little bit better than one in every billion tries. That's an extremely low false positive rate. If you looked up all 200 million words, the probability that you'd never hit a single false positive is about 82%.
John Feminella
A: 

Assumptions:

  1. You're going to search for a word many times in the lifetime of a process (as opposed to starting a process each time you look for a word).
  2. The file is sorted.

You could partially index the data, taking up most of the available memory: store words and their starting position in the file using either a B-tree or a sorted array (latter is more space efficient, but requires a single continuous block; also, b-tree requires you store the end position of a chunk while array does not). Leave enough memory space to load a single chunk of words from the file. Search the index (tree traversal or binary search) for the chunk that would contain the word. Once you've found the specific chunk from the partial index, load the corresponding chunk from the file into memory and perform a binary search on it.

If you need additional memory, you can toss some elements from the index. With the array, you can reduce the index to n elements using the following pseudo-C pseudocode:

struct chunk {
    string word;
    int start;
};
chunk index[];
d = index.length / n;
for (i=0;i<n; ++i) {
    index[i] = index[i*d];
}
realloc(index, sizeof(chunk) * n);

As the end of chunk i is index[i+1].start, the algorithm is dead simple for an array implementation. For B-tree based indices, you can easily merge leaves with their parents.

outis
+3  A: 

Classically word lookup problems can be efficiently solved using a Trie. Unfortunately, as you mentioned, you can't store all of the data you need in memory, but that shouldn't stop you from using a Trie to reduce the search space. Suppose instead of storing the entire set of words in the Trie you store only the initial segment, and your end nodes point to small collections of data that are easily (and quickly) searched in the database.

Mark E
+1 I was thinking along the same lines
David Zaslavsky
outis
suppose my sample space is all letters and all numbers,then 36^4 =1,679,616, I can build at most 4 to 5 levels of tries, but I guess it still helps reducing the search space a lot.
Sure...but since you're encoding what amounts to the set of all 4 character prefixes you're still *substantially* reducing your set. It depends on the distribution of the lengths of your words how efficient this really is, but in the worst case you can search the first 4 characters in O(4) time and search the subset of words at that leaf using outis' solution, as he suggests in his comment.
Mark E
@wo_shi_ni_ba_ba: your math's off, but it still works out to 5 levels using 1/2 GiB. Each node takes `n=37*sizeof(void*)` bytes (36 child node pointers in a leaf, the 36 pointers are interpreted as file ranges [start/length pairs]. With alignment, a node takes up same space as 37 pointers). There are `t=2 ** 29 / n` total nodes (1/2 GiB divided by size of a node), giving a depth of roughly `log(t)/log(36)+1` for a complete tree. A node doesn't know which character it starts with; that information is stored implicitly in the node's index in a child pointer array.
outis
A: 

If some words are consulted with much higher frequency than others then it may make sense to have an in memory LRU cache and a database behind it.

David Plumpton