Best practice for determining the probability that 2 strings match

views:

125

answers:

+2 Q:

Best practice for determining the probability that 2 strings match

I need to write code to determine if 2 strings match when one of the strings may contain a small deviation from the second string e.g. "South Africa" v "South-Africa" or "England" v "Enlgand". At the moment, I am considering the following approach

Determine the percentage of characters in string 1 that match those in string 2
Determine the true probability of the match by combining the result of 1 with a comparison of the length of the 2 strings e.g. although all the characters in "SA" are found in "South Africa" it is not a very likely match since "SA" could be found in a range of other country names as well.

I would appreciate to hear what current best practice is for performing such string matching.

Thank you

+10 A:

You can look at Levenshtein distance. This is distance between two strings. The same strings have distance equal 0. Strings such as kitten and sitten have distance equal 1, and so on. Distance is measured by minimal number of simple operations that transform one string to another.

More information and algorithm in pseudo-code is given in link.

I also remember that this topic was mentioned in Game programming gems: volume 6: Article 1.6 Closest-String Matching Algorithm

Dejw 2010-02-22 20:51:54

Great answer, thanks.

Run Loop 2010-02-22 21:01:13

Why is this algorithm "The best practice?"

Paco 2010-02-22 21:01:21

To implement Levenshtein distance algorithm You need only 4 loops, one if statement. I think it will cover less than one screen ;-) In the book I mentioned is more complicated and sophisticated way to match similar strings. I can scan this paper for You, but I have only Polish version ;-)

Dejw 2010-02-22 21:09:59

Or you download a library when it's not in the standard library of your programming language. When you use the 4 loop version, you are probably talking about the O(M*N) implementation. You can optimize to almost O(M)

Paco 2010-02-22 21:38:36

You are certainly right, but examples given by JK are short enough to use slower but easy-to-write version. In case of strings of length greater than 100 optimization is obviously necessary.

Dejw 2010-02-22 21:56:35

The number of strings to inspect is more significant than the string length in the case of optimizing the algorithm

Paco 2010-02-22 22:06:34

+7 A:

To make fuzzy string matching ideal, it's important to know about the context of the strings. When it's just about small typos, Levenstein can be good enough. When it's about misheard sound, you can use a phonetic algorithm like soundex or metaphone. Most times, you need a combination of the following algorithms, and some more specific manually written stuff.

Needleman-Wunsch
Soundex
Metaphone
Levenstein distance
Bitmap
Hamming distance

There is no best fuzzy string matching algorithm. It's all about the context it's used in, so you need to tell us about where you want to use the string matching for.

Paco 2010-02-22 21:00:33

+2 A:

Don't reinvent the wheel. Wikipedia has the Levenshtein algorithm which has metrics for what you want to do.

http://en.wikipedia.org/wiki/Levenshtein_distance

There's also Soundex, but that might be too simplistic for your requirements.

stillstanding 2010-02-22 21:02:43

ansaurus

tags:

views:

answers:

Best practice for determining the probability that 2 strings match

related questions