views:

933

answers:

4

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".

Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.

This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.

I know this exists, i've seen it before, but I can't remember its name.


Edit: Thanks for pointing out Levenshtein and Hamming. Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.

I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want? Can someone tell me a bit about the trade-offs?

+2  A: 

You're looking for the Levenshtein distance

tehvan
+2  A: 
  • Levenstein distance
  • Hamming distance
  • soundex
  • metaphone
vartec
Hmmm... ok... which one should I use? If I remember correctly, Soundex is not useful, because it's dependent on the first letter being the same, plus the time I used it (different project), I wasn't very happy about it.What's the tradeoffs between Levenshtein and Hamming, for example?
Daniel Magliola
Hamming distance can only be used on strings of the same length... Levenshtein distance is like a generalization of Hamming distance
tehvan
Well, Hamming distance is more for theoretical purposes. If you want to correct or ignore typos -- Levenstein. If you want to correct or ignore bad spelling -- metaphone. Note however, that Levenstein works in any language, metaphone -- English only.
vartec
As far as Soundex is concerned, I don't think it depends on the first letter being the same and you can find localized versions of the algorithm (at least for French).I personally had good results with the French version.
Benoit Myard
Not to say that Soundex can be very handy for large data sets since you will get a Soundex representation of your word which never varies.I had over 80000 entries in a database and stored their Soundex representations for fast matching against a needle.I never made any real benchmark though.
Benoit Myard
Actually it was more around 120000 different words/soundex pairs.Sound matching was bleeding fast since for a given auery I only had to compute a Soundex representation for the needle.
Benoit Myard
Benoit, yes, I agree. I've used Soundex for the same reasons (I could precompute it for my records, and do a fast SQL query on it).But in this case, I need more accuracy and less speed.
Daniel Magliola
One plus for Soundex is that MSSQL has built in support for it.
Stefan
+2  A: 

the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.

Autoplectic
+11  A: 

Ok, so the standard algorithms are:

1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.

2) Levenstein distance. The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.

3) Generalized Levenstein/(Damerau–Levenshtein distance) This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).

In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree

Il-Bhima