views:

961

answers:

6

Hi,

do you know any good algorithms that match two strings and then return a percentage in how many percent those two strings match?

And are there some, that work with databases too?

A: 

Maybe this one will be useful for ya:

http://stackoverflow.com/questions/53480/fuzzy-text-sentencestitles-matching-in-c

Lukas Šalkauskas
+6  A: 

The Levenstein distance is such a measure. It basically tells you how many characters need to be edited, deleted or added, to get from the first to the second string. I'm not sure whether some database systems support that.

But I know for sure that a much more simplified algorithm named Soundex is supported in some database systems.

Bananeweizen
+1  A: 

I think the problem you're looking for is called Edit Distance. It is expensive to compute in general, but if you are looking for strings within small edit distance of other strings, it is not so bad. There is more information in the Wikipedia article.

Norman Ramsey
A: 

Would this be of help? I just ran into it. http://stackoverflow.com/questions/188476/comparing-two-strings-producing-a-numeric-delta

Skunk
+2  A: 

It depends upon your criteria for similarity. Other people have already referred you to Levenstein distance (edit distance is the same thing). That's usually pretty good, and definitely more language-independent than something like soundex. However, be aware that Levenstein difference does not handle transposition very well. Thus:

Levenstein("copy", "cpoy") == 2

If you're trying to deal with human input, transpositions are fairly common. Whether that's a problem or not depends on your metrics for similarity.

It's been a while, but I believe Postgresql has levenstein() either built-in or available as a contrib C module.

Tom
A: 

How to best match two strings? Have them go out for coffee, and if they hit it off, dinner and a movie. Or maybe they could do some peer programming? It depends on the strings, really. Even coffee can often be tricky.

Yar