views:

152

answers:

3

So I've got a column in a table that contains a string values (keywords populated from a 3rd party tool). I'm working on an automated tool to identify clusters of similar values that could probably be normalized to a single value. For example, "Firemen"/"Fireman", "Isotope"/"Asotope" or "Canine"/"Canines".

An approach that calculates the levenshtein distance seems ideal except for the fact that it involves too much string manipulation/comparison and would probably make poor use of SQL indexes.

I've considered incrementally grouping by the Left(X) characters of the column, which is a not-so-bad way to maximize index use, but this approach is really only effective at finding words with differences at the very end of the word.

Anyone got some good ideas for solving this problem efficiently in SQL?

Note: I realize this question is very similar to (http://stackoverflow.com/questions/577463/finding-how-similar-two-strings-are), but the distinction here is the need to do this efficiently in SQL.

+2  A: 

You don't mention what DB your using, but if it's T-SQL, you could use the SOUNDEX value and difference.

JP Alioto
Presently I am using T-SQL, but I didn't mention it in hopes of getting a more generalized answers that might work in multiple database platforms. The Soundex/Difference approach looks promising though. I'll give it a spin.
JohnFx
+1  A: 

If you are using SQL Server, you might look into using the SOUNDEX() function as in:

...
where
   SOUNDEX("searchterm") = SOUNDEX(searchvaluefield)

it is supposed to do Phonetic matching on the strings ...

Ron

Some odd examples ... so it seems you could catch plurals by always appending the plural text to both sides, since multiple 's's sound the same ... :-)

select soundex('Canine'), soundex('Canines')
go

----- ----- 
C550  C552  

1 Row(s) affected


select soundex('Canine'), soundex('Caynyn')
go

----- ----- 
C550  C550  

1 Row(s) affected


select soundex('Canines'), soundex('Caniness')
go

----- ----- 
C552  C552  

1 Row(s) affected
Ron Savage
Any experience on how well Soundex resolves plural versions of words? Sounds like it would do well for "firemen"/"Fireman" but perhaps not so much for "Canine"/"Canines".
JohnFx
A: 

John, if you are using MS SQL Server, you can take advantage of the Full-Text Indexing service. Full-text search functionality has some powerful functions using which you can achieve this.

Kirtan