Comparing string distance based on precomputed hashes | ansaurus

tags:

views:

42

answers:

1

+3 Q:

Comparing string distance based on precomputed hashes

I have a large list (over 200,000) of strings that I'd like to compare to a given string. The given string is inserted by a user, so it may be slightly incorrect.

What I was hoping to do was create some kind of precomputed hash on each string on adding it to the list. This hash would contain information such as string length, addition of all the characters etc.

My question is, does something like this already exist? Surely there would be something that lets me avoid running Levenshtein distance on every string in the list?

Or maybe there's a third option I haven't thought of yet?

+1 A:

Sounds like you want to use a fuzzy hash of some sort. There are lots of hash functions available that can do things like this. The classic old "SOUNDEX" algorithm might even work.

Another thought - if you estimate that the probability of an incorrect entry is low, then you might actually be fine having a direct hit 99.9% of the time, falling back to SOUNDEX which might catch 90% of the remaining cases and then searching the whole list for the remaining 0.01% of the time.

Also worth checking this discussion: http://stackoverflow.com/questions/309479/how-to-find-best-fuzzy-match-for-a-string-in-a-large-string-database

mikera 2010-08-12 23:41:40

related questions

Does anyone have a good Proper Case algorithm

Converting bool to text in C++

Does PHP have an equivalent to this type of Python string substitution?

PHP ToString() equivalent

What's the difference between a string constant and a string literal?

What would be the fastest way to remove Newlines from a String in C#?

Why is String.Format static?

PowerShell - How do I pass string parameters correctly?

What's the best string concatenation method using C#?

Java: Best way of converting List<Integer> to List<String>

C# String output: format or concat?

Parse usable Street Address, City, State, Zip from a string

C# Save Dialogs

How do I Convert a string to an enum in C#?

PowerShell - how do I do a string replacement in a function?

Case insensitive string comparison in C++

Why doesn't Ruby have a real StringBuffer or StringIO?

Checking for string contents? string Length Vs Empty String

Remove Quotes and Commas from a String in MySQL

Test serialization encoding

In C# what is the difference between String and string

String.indexOf function in C

What is the best way to parse strings in Java

Format string to title case

Generate list of all possible permutations of a string