views:

2155

answers:

6

I'm looking for a string similarity algorithm that yields better results on variable length strings than the ones that are usually suggested (levenshtein distance, soundex, etc).

For example,

Given string A: "Robert",

Then string B: "Amy Robertson"

would be a better match than

String C: "Richard"

Also, preferably, this algorithm should be language agnostic (also works in languages other than English).

+12  A: 

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes:

http://www.catalysoft.com/articles/StrikeAMatch.html

Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment by Mark Wong-VanHaren) so that I can use it in my PostgreSQL queries:

CREATE FUNCTION string_similarity(str1 varchar, str2 varchar)
RETURNS float8 AS '

str1.downcase! 
pairs1 = (0..str1.length-2).collect {|i| str1[i,2]}.reject {
  |pair| pair.include? " "}
str2.downcase! 
pairs2 = (0..str2.length-2).collect {|i| str2[i,2]}.reject {
  |pair| pair.include? " "}
union = pairs1.size + pairs2.size 
intersection = 0 
pairs1.each do |p1| 
  0.upto(pairs2.size-1) do |i| 
    if p1 == pairs2[i] 
      intersection += 1 
      pairs2.slice!(i) 
      break 
    end 
  end 
end 
(2.0 * intersection) / union

' LANGUAGE 'plruby';

Works like a charm!

marzagao
You found the answer and wrote all that in 4 minutes? Impressive!
Matt J
Interestingly, Simon's approach has a lot in common with approaches such as q-grams and Dice's Coefficient.
Jason Sundram
I prepared my answer after some research and implementation. I put it here to the benefit of whoever else comes looking in SO for a practical answer using an alternative algorithm because most of the answers in related questions seem to revolve around levenshtein or soundex.
marzagao
Just what I have been looking for. Will you marry me?
sharas
very nice! working on an IR system and Soundex algorithm couldn't cut it! Excelent articles by the linked arthor
robinsonc494
A: 

What about Levenshtein distance, divided by the length of the first string (or alternatively divided my min/max/avg length of both strings)? That has worked for me so far.

tehvan
+4  A: 

String Similarity Metrics contains an overview of many different metrics used in string comparison (Wikipedia has an overview as well). Much of these metrics is implemented in a library simmetrics.

Yet another example of metric, not included in the given overview is for example compression distance (attempting to approximate the Kolmogorov's complexity), which can be used for a bit longer texts than the one you presented.

You might also consider looking at a much broader subject of Natural Language Processing. These R packages can get you started quickly (or at least give some ideas).

And one last edit - search the other questions on this subject at SO, there are quite a few related ones.

Anonymous
+3  A: 

marzagao's aswer is great. I converted it to C# so I thought I'd post it here:

/// <summary>
/// This class implements string comparison algorithm
/// based on character pair similarity
/// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
/// </summary>
public class SimilarityTool
{
    /// <summary>
    /// Compares the two strings based on letter pair matches
    /// </summary>
    /// <param name="str1"></param>
    /// <param name="str2"></param>
    /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
    public double CompareStrings(string str1, string str2)
    {
     List<string> pairs1 = WordLetterPairs(str1.ToUpper());
     List<string> pairs2 = WordLetterPairs(str2.ToUpper());

     int intersection = 0;
     int union = pairs1.Count + pairs2.Count;

     for (int i = 0; i < pairs1.Count; i++)
     {
      for (int j = 0; j < pairs2.Count; j++)
      {
       if (pairs1[i] == pairs2[j])
       {
        intersection++;
        pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success

        break;
       }
      }
     }

     return (2.0 * intersection) / union;
    }

    /// <summary>
    /// Gets all letter pairs for each
    /// individual word in the string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private List<string> WordLetterPairs(string str)
    {
     List<string> AllPairs = new List<string>();

     // Tokenize the string and put the tokens/words into an array
     string[] Words = Regex.Split(str, @"\s");

     // For each word
     for (int w = 0; w < Words.Length; w++)
     {
      if (!string.IsNullOrEmpty(Words[w]))
      {
       // Find the pairs of characters
       String[] PairsInWord = LetterPairs(Words[w]);

       for (int p = 0; p < PairsInWord.Length; p++)
       {
        AllPairs.Add(PairsInWord[p]);
       }
      }
     }

     return AllPairs;
    }

    /// <summary>
    /// Generates an array containing every 
    /// two consecutive letters in the input string
    /// </summary>
    /// <param name="str"></param>
    /// <returns></returns>
    private string[] LetterPairs(string str)
    {
     int numPairs = str.Length - 1;

     string[] pairs = new string[numPairs];

     for (int i = 0; i < numPairs; i++)
     {
      pairs[i] = str.Substring(i, 2);
     }

     return pairs;
    }
}
Michael La Voie
A: 

Thanks a lot for such a good Algorithm

Anand
A: 

This discussion has been really helpful, thanks. I converted the algorithm to VBA for use with Excel and wrote a few versions of a worksheet function, one for simple comparison of a pair of strings, the other for comparing one string to a range/array of strings. The strSimLookup version returns either the last best match as a string, array index, or similarity metric.

This implementation produces the same results listed in the Amazon example on Simon White's website with a few minor exceptions on low-scoring matches; not sure where the difference creeps in, could be VBA's Split function, but I haven't investigated as it's working fine for my purposes.

'Implements functions to rate how similar two strings are on
'a scale of 0.0 (completely dissimilar) to 1.0 (exactly similar)
'Source:   http://www.catalysoft.com/articles/StrikeAMatch.html
'Author: Bob Chatham, bob.chatham at gmail.com
'9/12/2010

Option Explicit

Public Function stringSimilarity(str1 As String, str2 As String) As Variant
'Simple version of the algorithm that computes the similiarity metric
'between two strings.
'NOTE: This verision is not efficient to use if you're comparing one string
'with a range of other values as it will needlessly calculate the pairs for the
'first string over an over again; use the array-optimized version for this case.

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection

    Set sPairs1 = New Collection
    Set sPairs2 = New Collection

    WordLetterPairs str1, sPairs1
    WordLetterPairs str2, sPairs2

    stringSimilarity = SimilarityMetric(sPairs1, sPairs2)

    Set sPairs1 = Nothing
    Set sPairs2 = Nothing

End Function

Public Function strSimA(str1 As Variant, rRng As Range) As Variant
'Return an array of string similarity indexes for str1 vs every string in input range rRng
    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim arrOut As Variant
    Dim l As Long, j As Long

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    l = rRng.Count
    ReDim arrOut(1 To l)
    For j = 1 To l
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(j)), sPairs2
        arrOut(j) = SimilarityMetric(sPairs1, sPairs2)
        Set sPairs2 = Nothing
    Next j

    strSimA = Application.Transpose(arrOut)

End Function

Public Function strSimLookup(str1 As Variant, rRng As Range, Optional returnType) As Variant
'Return either the best match or the index of the best match
'depending on returnTYype parameter) between str1 and strings in rRng)
' returnType = 0 or omitted: returns the best matching string
' returnType = 1           : returns the index of the best matching string
' returnType = 2           : returns the similarity metric

    Dim sPairs1 As Collection
    Dim sPairs2 As Collection
    Dim metric, bestMetric As Double
    Dim i, iBest As Long
    Const RETURN_STRING As Integer = 0
    Const RETURN_INDEX As Integer = 1
    Const RETURN_METRIC As Integer = 2

    If IsMissing(returnType) Then returnType = RETURN_STRING

    Set sPairs1 = New Collection

    WordLetterPairs CStr(str1), sPairs1

    bestMetric = -1
    iBest = -1

    For i = 1 To rRng.Count
        Set sPairs2 = New Collection
        WordLetterPairs CStr(rRng(i)), sPairs2
        metric = SimilarityMetric(sPairs1, sPairs2)
        If metric > bestMetric Then
            bestMetric = metric
            iBest = i
        End If
        Set sPairs2 = Nothing
    Next i

    If iBest = -1 Then
        strSimLookup = CVErr(xlErrValue)
        Exit Function
    End If

    Select Case returnType
    Case RETURN_STRING
        strSimLookup = CStr(rRng(iBest))
    Case RETURN_INDEX
        strSimLookup = iBest
    Case Else
        strSimLookup = bestMetric
    End Select

End Function

Public Function strSim(str1 As String, str2 As String) As Variant
    Dim ilen, iLen1, ilen2 As Integer

    iLen1 = Len(str1)
    ilen2 = Len(str2)

    If iLen1 >= ilen2 Then ilen = ilen2 Else ilen = iLen1

    strSim = stringSimilarity(Left(str1, ilen), Left(str2, ilen))

End Function

Sub WordLetterPairs(str As String, pairColl As Collection)
'Tokenize str into words, then add all letter pairs to pairColl

    Dim Words() As String
    Dim word, nPairs, pair As Integer

    Words = Split(str)

    If UBound(Words) < 0 Then
        Set pairColl = Nothing
        Exit Sub
    End If

    For word = 0 To UBound(Words)
        nPairs = Len(Words(word)) - 1
        If nPairs > 0 Then
            For pair = 1 To nPairs
                pairColl.Add Mid(Words(word), pair, 2)
            Next pair
        End If
    Next word

End Sub

Private Function SimilarityMetric(sPairs1 As Collection, sPairs2 As Collection) As Variant
'Helper function to calculate similarity metric given two collections of letter pairs.
'This function is designed to allow the pair collections to be set up separately as needed.
'NOTE: sPairs2 collection will be altered as pairs are removed; copy the collection
'if this is not the desired behavior.
'Also assumes that collections will be deallocated somewhere else

    Dim Intersect As Double
    Dim Union As Double
    Dim i, j As Long

    If sPairs1.Count = 0 Or sPairs2.Count = 0 Then
        SimilarityMetric = CVErr(xlErrNA)
        Exit Function
    End If

    Union = sPairs1.Count + sPairs2.Count
    Intersect = 0

    For i = 1 To sPairs1.Count
        For j = 1 To sPairs2.Count
            If StrComp(sPairs1(i), sPairs2(j)) = 0 Then
                Intersect = Intersect + 1
                sPairs2.Remove j
                Exit For
            End If
        Next j
    Next i

    SimilarityMetric = (2 * Intersect) / Union

End Function
bchatham