views:

221

answers:

7

I am trying to write a query to find all USERS in a USERS table that are similar. Its like finding all non distinct rows but using a LIKE statement not an equals statement. Below is an example column of a USERS table

 USERNAME 
------------
tim.smith
doug.funny
tim.smith1
dan.snyder
tim.smith20
doug.funny2
emily.hunt

after query the output should look like this.

tim.smith
tim.smith1
tim.smith20
doug.funny
doug.funny2
A: 

If your data is exactly as you describe, you could just trim the numbers off the end before ordering them, but I suspect your actual data is more complicated than that.

You might want to look at the sql server SOUNDEX and DIFFERENCE functions. It may not be exactly what you need, but it would probably get you close

Eric Petroelje
A: 

Couldn't you just do

select * from USERTABLE order by USERNAME;
Jared
A: 

It would be helpful if you define exactly what you mean by "similar". Do the entries always follow a certain pattern, such as "letters, period, letters, optional numbers"?

If you're looking for typos or possible misspellings, you can try fuzzy string matching algorithms, such as soundex or Levenshtein edit distance.

weiyin
A: 

Fancy UDF based on the Levenshtein distance:

CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
  DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
    @cv0 varbinary(8000), @cv1 varbinary(8000)
  SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
  WHILE @j <= @s2_len
    SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
  WHILE @i <= @s1_len
  BEGIN
    SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
    WHILE @j <= @s2_len
    BEGIN
      SET @c = @c + 1
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
        CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
      IF @c > @c_temp SET @c = @c_temp
      SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
      IF @c > @c_temp SET @c = @c_temp
      SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
    END
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN @c
END

(Cribbed from SQLTeam.com forums.)

Unfortunately, this only works when provided two strings, and in O(nm) time, where n and m are the length of the two strings. However, you could improve upon this by modifying the function to cut out as a "NO MATCH" when @c is greater than, say, 3.

But then you'd still need a cursor to walk through each user name in the table and compare it to every other user. Not very efficient, that's for sure.

kthejoker
Rather than a cursor, you could use a cartesian join (one of the few instances where you would actually want to do that). Of course the performance would still be awful.
Eric Petroelje
A: 

Or you could write a function computing the levenshtein distance (http://de.wikipedia.org/wiki/Levenshtein-Distanz) and then join the table using an expression like

levenshtein(a.username, b.username) <=2

you might want to remove any special characters from the names before that, if you considere such characters 'garbage'

Jens Schauder
+1  A: 

I have been looking for a legitimate reason to use a cross join

Select Distinct u1.UserID 
from username  u1 Cross join username u2
where u1.UserID <> u2.UserID
and 
(PatIndex('%' + u1.UserID  + '%', u2.UserID)  <> 0
 OR
PatIndex('%' + u2.UserID  + '%', u1.UserID)  <> 0)
order by u1.UserID

doug.funny
doug.funny2
tim.smith
tim.smith1
tim.smith20
cmsjr
Awesome this worked great! Thanks!
Glad I could help.
cmsjr
Note: This only works if at least ONE occurance of the name does not have a number after it... If there were Dan.Snyder1 and Dan.Snynder2 but no Dan.Snyder, this would not work...
Dems
also, if this helps and is your correct answer, it's rude not to mark it so and give cmsjr his due kudos :)
Dems
That's a good point, a supporting function to strip out numerals would do the trick. Let me know if that would help, or if the above was adequate for your purposes.
cmsjr
A: 

SSIS has functions in it to handle this for you. Run the table through SSIS and it will spid out the tables and the matches along with how close they match.

mrdenny