views:

52

answers:

2

Hi,

I am looking for a simple way (UDF?) to establish the similarity between strings. The SOUNDEX and DIFFERENCE function do not seem to do the job.

Similarity should be based on number of characters in common (order matters).

For example:

Spiruroidea sp. AM-2008

and

Spiruroidea gen. sp. AM-2008

should be recognised as similar.

Any pointers would be very much appreciated.

Thanks.

Christian

+5  A: 

You may want to consider implementing the Levenshtein Distance algorithm as a UDF, so that it will return the number of operations that need to be performed on String A in order for it to become String B. This is often referred to as the edit distance.

You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of String A or String B.

You would simply use it as follows:

WHERE LEVENSHTEIN(Field_A, Field_B) < 4;

You may want to check out the following Levenshtein Distance implementation for SQL Server:

Daniel Vassallo
Even though it seems complicated....it is very cool. This would give more accurate results than SoundEx. +1 for you.
Raja
+1  A: 

These kind of things are not trivial and you should provide more examples.

As mentioned by Daniel levenshtein distance is a way to go, but also for your example you might want to pre-process the strings if you know that you can safely drop certain words - for example it seems from your example that the word gen. can be dropped.

The levenshtein distance will consider any four letter word instead of gen. as the same as gen. which might not be what you want.

Also if your data set will come from different data sources you might consider building a dictionary of synonyms and investigate existing standard taxonomies for your domain. Perhaps such as this?

Unreason