i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.
You may want to look into Tries, and in many cases they are faster than binary search.
If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.
Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to. http://en.wikipedia.org/wiki/Soundex
In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).
The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.
And, you get your choice of "BTree" (actually more like a B*Tree
) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.
If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).
after your comment on David's answer, I'd say that you need two different indexes:
- the 'inverted index', where you keep all the words, each with a list of places found
- an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.
or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.
Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.