views:

995

answers:

10

Given two strings text1 and text2

public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
     // DO SOMETHING HERE TO COMPARE
}

Examples:

  1. First String: StackOverflow

    Second String: StaqOverflow

    Return: Similarity is 91%

    The return can be in % or something like that.

  2. First String: The simple text test

    Second String: The complex text test

    Return: The values can be considered equal

Any ideas? What is the best way to do this?

+10  A: 

Levenshtein distance is probably what you're looking for.

LiraNuna
+1  A: 

You might look for string "distances", for example the Levenshtein distance.

schnaader
+3  A: 

See Levenshtein Distance in C#.

Rony
+9  A: 

There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" page for links to other pages with algorithms.

I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.

I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.

Jon Skeet
A: 

Jeff Atwood wrote about looking for a similar solution for determining the authorship of wiki posts which may help you narrow your search.

John Sheehan
+3  A: 

To deal with 'sound alikes' you may want to look into encoding using a phonetic algorithm like Double Metaphone or soundex. I don't know if computing Levenshtein distances on phonetic encoded strings would be beneficial or not, but might be a possibility. Alternately, you could use a heuristic like: convert each word in the string to its encoded form and remove any words that occur in both strings and replace them with a single representation before computing the Levenshtein distance.

bdk
The soundex algorithm is used by the medical community to warn about like sounding surnames. It is included in many standard libraries.
slypete
Metaphone is far superior to Soundex.
Dour High Arch
+2  A: 

Perl module Text::Phonetic has implementations of various algorithms.

Sinan Ünür
+1  A: 

If you're comparing values in a SQL database you can use the SOUNDEX function. If you query Google for SOUNDEX and C#, some people have written a similar function for that and VB.

Rob
A: 

I have to recommend Soundex too, I have used it in the past to process misspelt city names. Here is a good link for usage: http://whitepapers.zdnet.com/abstract.aspx?docid=352953

A: 

I wrote a Double Metaphone implementation in C# a while back. You'll find it vastly superior to Soundex and the like.

Levenshtein distance has also been suggested, and it's a great algorithm for a lot of uses, but phonetic matching is not really what it does; it only seems that way sometimes because phonetically similar words are also usually spelled similarly. I did an analysis of various fuzzy matching algorithms which you might also find useful.

anelson