ansaurus

Question

T-SQL Soundex/Difference to find duplicate rows

Answer 1

+1 A:

If you just want to get likely duplicates, the checksum/binary_checksum functions would give you a good indication, though it's just a 32bit hash so depending on your dataset size you may end up with a few false-positives. checksum() is case-insensitive, binary_checksum() is case-sensitive. This will give you a 32bit hash for every record in your table:

select   checksum(*), binary_checksum(*)
from     tableName;

You could do a self join matching on duplicate hashes for records with different ID values (or different name values, etc. depending on what makes a given record unique in your dataset). Would look something like this:

select   id, checksum(*)
from     tableName a
join     tableName b
on       a.checksum(*) = b.checksum(*)
and      a.id <> b.id;

These 2 functions can take any list of columns for an argument and provide a hash, so if you just want to hash the fName, lName, address, etc. columns rather than the whole record, your checksum function would look like this:

checksum(a.fName, a.lName, a.address, ...)

rather than checksum(*) like in the examples above.

chadhoc 2009-12-09 17:03:48

Nice for exactlmathces, not so useful to find near misses.

HLGEM 2009-12-09 17:35:22

ahhh - I think I misunderstood the OPs intent...I originally assumed he was looking for duplicates that he could purge (i.e. duplicates with same overall values just in different locations/columns). After re-reading and seeing your answer above, sounds like the OP is looking for fuzzy matches, which is a totally different story from my answer altogether...

chadhoc 2009-12-09 18:38:38

Answer 2

+3 A:

If you are using SQl server 2005 or above, you can use fuzzy matching in SSIS to do this task. I found that I got significantly better results in doing this than in looking for soundex matches or writng my own sql scode to look for near matches.

HLGEM 2009-12-09 17:34:24

Fuzzy Matches ARE fantastic. Just be aware that they require SQL Server Enterprise Edition. Here's an article showing how to use them: http://www.sqlteam.com/article/using-fuzzy-lookup-transformations-in-sql-server-integration-services

Michael K Campbell 2009-12-18 16:15:05

ansaurus

tags:

views:

answers:

T-SQL Soundex/Difference to find duplicate rows

related questions