views:

42

answers:

1

I have a large list (over 200,000) of strings that I'd like to compare to a given string. The given string is inserted by a user, so it may be slightly incorrect.

What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.

My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?

Or maybe there's a third option I haven't thought of yet?

+1  A: 

Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.

Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.

Also worth checking this discussion: http://stackoverflow.com/questions/309479/how-to-find-best-fuzzy-match-for-a-string-in-a-large-string-database

mikera