views:

383

answers:

6

Hi!

I'm looking for a way to compare a string with an array of strings. Doing an exact search is quite easy of course, but I want my program to tolerate spelling mistakes, missing parts of the string and so on.

Is there some kind of framework which can perform such a search? I'm having something in mind that the search algorithm will return a few results order by the percentage of match or something like this.

Best Regards,
Oliver Hanappi

+1  A: 

Here are two methods that calculate the Levenshtein Distance between strings.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.

Once you have the result, you'll need to define what value you want to use as a threshold for a match or not. Run the function on a bunch of sample data to get a good idea of how it works to help decide on your particular threshold.

    /// <summary>
    /// Calculates the Levenshtein distance between two strings--the number of changes that need to be made for the first string to become the second.
    /// </summary>
    /// <param name="first">The first string, used as a source.</param>
    /// <param name="second">The second string, used as a target.</param>
    /// <returns>The number of changes that need to be made to convert the first string to the second.</returns>
    /// <remarks>
    /// From http://www.merriampark.com/ldcsharp.htm
    /// </remarks>
    public static int LevenshteinDistance(string first, string second)
    {
        if (first == null)
        {
            throw new ArgumentNullException("first");
        }
        if (second == null)
        {
            throw new ArgumentNullException("second");
        }

        int n = first.Length;
        int m = second.Length;
        var d = new int[n + 1, m + 1]; // matrix

        if (n == 0) return m;
        if (m == 0) return n;

        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        for (int i = 1; i <= n; i++)
        {

            for (int j = 1; j <= m; j++)
            {
                int cost = (second.Substring(j - 1, 1) == first.Substring(i - 1, 1) ? 0 : 1); // cost
                d[i, j] = Math.Min(
                    Math.Min(
                        d[i - 1, j] + 1,
                        d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }

        return d[n, m];
    }
Sam
+1  A: 

How about starting at Wikipedia's Approximate String Matching article?

Thorsten79
+6  A: 

You could use the Levenshtein Distance algorithm.

"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com

This one is from dotnetperls.com:

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

RedFilter
I swear, Levenshtein Distance comes up on SO at least three times a week.
Adam Robinson
@Adam - 'cause it's *that awesome* :)
RedFilter
@Adam, great guess. It's been 9 times so far this month already, which is almost exactly 3 / week.
Sam
I'd have to argue against Levenshtein Distance in this case. While it's great for figuring out how different two strings are, spelling mistakes more often than not retain the proper phonetics. For example, the LD algorithm will probably *not* indicate that "cool cat" and "kool kat" are similar (which is what I believe the poster wishes) whereas Soundex and Metaphone are more likely to indicate the similarity between those words/phrases.
casperOne
@casperOne: hard to say without knowing the data set it is being applied to, but agree there is no one-size-fits-all approach. I'm a big fan of double metaphone.
RedFilter
A: 

There is a similar question here.

I also wrote a blog post on string similarity here, which includes full source to a set of extension methods on string to get Levenshtein Distance, SoundEx, Double Metaphones and ask if two strings are similar.

plinth
+4  A: 

There is nothing in the .NET framework that will help you with this out-of-the-box.

The most common spelling mistakes are those where the letters are a decent phonetic representation of the word, but not the correct spelling of the word.

For example, it could be argued that the words sword and sord (yes, that's a word) have the same phonetic roots (they sound the same when you pronounce them).

That being said, there are a number of algorithms that you can use to translate words (even mispelled ones) into phonetic variants.

The first is Soundex. It's fairly simple to implement and there are a fair number of .NET implementations of this algorithm. It's rather simple, but it gives you real values you can compare to each other.

Another is Metaphone. While I can't find a native .NET implementation of Metaphone, the link provided has links to a number of other implementations which could be converted. The easiest to convert would probably be the Java implementation of the Metaphone algorithm.

It should be noted that the Metaphone algorithm has gone through revisions. There is Double Metaphone (which has a .NET implementation) and Metaphone 3. Metaphone 3 is a commercial application, but has a 98% accuracy rate compared to an 89% accuracy rate for the Double Metaphone algorithm when run against a database of common English words. Depending on your need, you might want to look for (in the case of Double Metaphone) or purchase (in the case of Metaphone 3) the source for the algorithm and convert or access it through the P/Invoke layer (there are C++ implementations abound).

Metaphone and Soundex differ in the sense that Soundex produces fixed length numeric keys, whereas Metaphone produces keys of different length, so the results will be different. In the end, both will do the same kind of comparison for you, you just have to find out which suits your needs the best, given your requirements and resources (and intolerance levels for the spelling mistakes, of course).

casperOne
You'll find Double Metaphone implemented in C# here (download the source file): http://www.codeproject.com/KB/string/dmetaphone6.aspx
RedFilter
@OrbMan: I've updated the answer, thanks for the link.
casperOne
+1  A: 

You can find implementations of soundex and the levenshtein distance algorithms in the open source CommonLibrary.NET project.

Elijah