Finding how similar two strings are

views:

933

answers:

+14 Q:

Finding how similar two strings are

I'm looking for an algorithm that takes 2 strings and will give me back a "factor of similarity".

Basically, I will have an input that may be misspelled, have letters transposed, etc, and I have to find the closest match(es) in a list of possible values that I have.

This is not for searching in a database. I'll have an in-memory list of 500 or so strings to match against, all under 30 chars, so it can be relatively slow.

I know this exists, i've seen it before, but I can't remember its name.

Edit: Thanks for pointing out Levenshtein and Hamming. Now, which one should I implement? They basically measure different things, both of which can be used for what I want, but I'm not sure which one is more appropriate.

I've read up on the algorithms, Hamming seems obviously faster. Since neither will detect two characters being transposed (ie. Jordan and Jodran), which I believe will be a common mistake, which will be more accurate for what I want? Can someone tell me a bit about the trade-offs?

+2 A:

You're looking for the Levenshtein distance

tehvan 2009-02-23 12:23:00

+2 A:

Levenstein distance
Hamming distance
soundex
metaphone

vartec 2009-02-23 12:25:49

Hmmm... ok... which one should I use? If I remember correctly, Soundex is not useful, because it's dependent on the first letter being the same, plus the time I used it (different project), I wasn't very happy about it.What's the tradeoffs between Levenshtein and Hamming, for example?

Daniel Magliola 2009-02-23 12:28:26

Hamming distance can only be used on strings of the same length... Levenshtein distance is like a generalization of Hamming distance

tehvan 2009-02-23 12:34:44

Well, Hamming distance is more for theoretical purposes. If you want to correct or ignore typos -- Levenstein. If you want to correct or ignore bad spelling -- metaphone. Note however, that Levenstein works in any language, metaphone -- English only.

vartec 2009-02-23 12:37:22

As far as Soundex is concerned, I don't think it depends on the first letter being the same and you can find localized versions of the algorithm (at least for French).I personally had good results with the French version.

Benoit Myard 2009-02-23 12:49:35

Not to say that Soundex can be very handy for large data sets since you will get a Soundex representation of your word which never varies.I had over 80000 entries in a database and stored their Soundex representations for fast matching against a needle.I never made any real benchmark though.

Benoit Myard 2009-02-23 12:53:42

Actually it was more around 120000 different words/soundex pairs.Sound matching was bleeding fast since for a given auery I only had to compute a Soundex representation for the needle.

Benoit Myard 2009-02-23 12:56:01

Benoit, yes, I agree. I've used Soundex for the same reasons (I could precompute it for my records, and do a fast SQL query on it).But in this case, I need more accuracy and less speed.

Daniel Magliola 2009-02-23 13:12:41

One plus for Soundex is that MSSQL has built in support for it.

Stefan 2009-02-23 13:14:53

+2 A:

the Damerau-Levenshtein distance is similar to the Levenshtein distance, but also includes two-character transposition. the wikipedia page (linked) includes pseudocode that should be fairly trivial to implement.

Autoplectic 2009-02-23 12:55:57

+11 A:

Ok, so the standard algorithms are:

1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text.

2) Levenstein distance. The Levenstein distance measures distance in terms of the number of "operations" required to transform one string to another. These operations include insertion, deletion and substition. The standard approach of calculating the Levenstein distance is to use dynamic programming.

3) Generalized Levenstein/(Damerau–Levenshtein distance) This distance also takes into consideration transpositions of characters in a word, and is probably the edit distance most suited for fuzzy matching of manually-entered text. The algorithm to compute the distance is a bit more involved than the Levenstein distance (detecting transpositions is not easy). Most common implementations are a modification of the bitap algorithm (like grep).

In general you would probably want to consider an implementation of the third option implemented in some sort of nearest neighbour search based on a k-d tree

Il-Bhima 2009-02-23 13:00:51

ansaurus

tags:

views:

answers:

Finding how similar two strings are

related questions