ansaurus

Question

Finding approximately duplicate database records using T-SQL?

Answer 1

+1 A:

Full-Text Search is your best bet here. Using Levenshtein on any non-trivial sized corpus of text soon becomes problematic due to the computational grunt required. It's more common to use LD/SOUNDEX etc for character based discrepancies rather than word based discrepancies. Assuming words are at minimum correctly spelled, FTS would be a better fit. I can also imagine a two-tiered approach using FTS to identify likely match candidates, with finer grained matching performed over the filtered set. If you really want to go to town, then one of the best performing structures for searching text is the Trie, but this is tricky to implement in tables, and works better as an in-memory data-structure. A word based n-gram solution might also be worth investigating.

spender 2009-12-31 03:02:56

Answer 2

A:

You might want to investigate the two T-SQL functions SoundEx() and Difference(). These might be of some use to you.

Charles Bretana 2009-12-31 03:06:22

Answer 3

+4 A:

If you only have to (bulk) load the table, or periodically remove duplicates, you could also use Fuzzy Grouping Transformation in SSIS -- here is a result for your example.

alt text

Results are grouped by _key_out, the "original" row is identified by _key_in = _key_out. If _key_out <> _key_in the row is similar to a previous one -- you can set minimum similarity, delimiters, case sensitivity etc..

Damir Sudarevic 2009-12-31 11:23:09

ansaurus

tags:

views:

answers:

Finding approximately duplicate database records using T-SQL?

related questions