views:

125

answers:

3

I need to write code to determine if 2 strings match when one of the strings may contain a small deviation from the second string e.g. "South Africa" v "South-Africa" or "England" v "Enlgand". At the moment, I am considering the following approach

  1. Determine the percentage of characters in string 1 that match those in string 2
  2. Determine the true probability of the match by combining the result of 1 with a comparison of the length of the 2 strings e.g. although all the characters in "SA" are found in "South Africa" it is not a very likely match since "SA" could be found in a range of other country names as well.

I would appreciate to hear what current best practice is for performing such string matching.

Thank you

+10  A: 

You can look at Levenshtein distance. This is distance between two strings. The same strings have distance equal 0. Strings such as kitten and sitten have distance equal 1, and so on. Distance is measured by minimal number of simple operations that transform one string to another.

More information and algorithm in pseudo-code is given in link.

I also remember that this topic was mentioned in Game programming gems: volume 6: Article 1.6 Closest-String Matching Algorithm

Dejw
Great answer, thanks.
Run Loop
Why is this algorithm "The best practice?"
Paco
To implement Levenshtein distance algorithm You need only 4 loops, one if statement. I think it will cover less than one screen ;-) In the book I mentioned is more complicated and sophisticated way to match similar strings. I can scan this paper for You, but I have only Polish version ;-)
Dejw
Or you download a library when it's not in the standard library of your programming language. When you use the 4 loop version, you are probably talking about the O(M*N) implementation. You can optimize to almost O(M)
Paco
You are certainly right, but examples given by JK are short enough to use slower but easy-to-write version. In case of strings of length greater than 100 optimization is obviously necessary.
Dejw
The number of strings to inspect is more significant than the string length in the case of optimizing the algorithm
Paco
+7  A: 

To make fuzzy string matching ideal, it's important to know about the context of the strings. When it's just about small typos, Levenstein can be good enough. When it's about misheard sound, you can use a phonetic algorithm like soundex or metaphone. Most times, you need a combination of the following algorithms, and some more specific manually written stuff.

  • Needleman-Wunsch
  • Soundex
  • Metaphone
  • Levenstein distance
  • Bitmap
  • Hamming distance

There is no best fuzzy string matching algorithm. It's all about the context it's used in, so you need to tell us about where you want to use the string matching for.

Paco
+2  A: 

Don't reinvent the wheel. Wikipedia has the Levenshtein algorithm which has metrics for what you want to do.

http://en.wikipedia.org/wiki/Levenshtein_distance

There's also Soundex, but that might be too simplistic for your requirements.

stillstanding