views:

215

answers:

3

Hey all. I have a MSSQL 2008 database with a fair number of rows. As of now, before new rows are inserted into the table, the stored procedure checks to see if that record already exists in the database (by checking a column labeled Title). This check is exact, and if the to-be-inserted record is slightly different, it will insert it instead of updating the existing row (which is an approximate match). What I would like to do is somehow detect approximate duplications in the table before inserting. So a new record that is to be inserted:

The quick brown fox jumps over the lazy dog

would approximately match:

Quick brown fox jumps over the lazy dog

if this record exists in the table already. I've seen (and used for other situations) the Levenshtein Distance algorithm implemented in T-SQL, but I'm not sure if this could be applied in my case because a pair of input strings are required to execute the algorithm. How are members of the community handing things of this sort? Thanks.

+1  A: 

Full-Text Search is your best bet here. Using Levenshtein on any non-trivial sized corpus of text soon becomes problematic due to the computational grunt required. It's more common to use LD/SOUNDEX etc for character based discrepancies rather than word based discrepancies. Assuming words are at minimum correctly spelled, FTS would be a better fit. I can also imagine a two-tiered approach using FTS to identify likely match candidates, with finer grained matching performed over the filtered set. If you really want to go to town, then one of the best performing structures for searching text is the Trie, but this is tricky to implement in tables, and works better as an in-memory data-structure. A word based n-gram solution might also be worth investigating.

spender
A: 

You might want to investigate the two T-SQL functions SoundEx() and Difference(). These might be of some use to you.

Charles Bretana
+4  A: 

If you only have to (bulk) load the table, or periodically remove duplicates, you could also use Fuzzy Grouping Transformation in SSIS -- here is a result for your example.

alt text

Results are grouped by _key_out, the "original" row is identified by _key_in = _key_out. If _key_out <> _key_in the row is similar to a previous one -- you can set minimum similarity, delimiters, case sensitivity etc..

Damir Sudarevic