tags:

views:

262

answers:

4

i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.

+1  A: 

You may want to look into Tries, and in many cases they are faster than binary search.

uniquesnowflake8
problem is, im doing search on very large disk file (random access) not on preloaded tree.
kar
for this B*trees are hard to beat
Javier
+1  A: 

If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.

Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to. http://en.wikipedia.org/wiki/Soundex

David
how do you implement big hash table with a large index? load them into memory? because im talking about 50GB+ index file here.
kar
50GB for index??? sounds way out of proportion for any human language.
Javier
its a collection of research documents (multilanguage utf8), index's dictionary file (word-length + word + offset to doc's content) alone over 50GB
kar
+1  A: 

In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).

The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.

And, you get your choice of "BTree" (actually more like a B*Tree) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.

If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).

Alex Martelli
memcached is a popular high-performance in-memory distributed hash table, used in many large web sites.
Jim Ferrans
@Jim, yeah, I use memcached (and its GAE implementation, memcache, in particular -- both the original and the GAE one are due to my friend and colleague Brian Fitzpatrick, AKA fitz;-) regularly, but with the 500-GB index size the OP's talking about, memcache's just not gonna help much... how many hundreds of machines do you have just about devoted to serving as memcached shard?!-).
Alex Martelli
+1  A: 

after your comment on David's answer, I'd say that you need two different indexes:

  1. the 'inverted index', where you keep all the words, each with a list of places found
  2. an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.

or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.

Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.

Javier
actually i had to reinvent the wheel as the collections is growing very fast each months and we have legacy apps which rely on our custom word-level inverted index format. so instead my task is to build new search interface for some new apps.im going to benchmark Tries (as suggessted by uniquesnowflake8) against redblack tree and see which is best.Thanks for all the replies, very much appreciated.
kar