So I've got a column in a table that contains a string values (keywords populated from a 3rd party tool). I'm working on an automated tool to identify clusters of similar values that could probably be normalized to a single value. For example, "Firemen"/"Fireman", "Isotope"/"Asotope" or "Canine"/"Canines".
An approach that calculates the levenshtein distance seems ideal except for the fact that it involves too much string manipulation/comparison and would probably make poor use of SQL indexes.
I've considered incrementally grouping by the Left(X) characters of the column, which is a not-so-bad way to maximize index use, but this approach is really only effective at finding words with differences at the very end of the word.
Anyone got some good ideas for solving this problem efficiently in SQL?
Note: I realize this question is very similar to (http://stackoverflow.com/questions/577463/finding-how-similar-two-strings-are), but the distinction here is the need to do this efficiently in SQL.