views:

318

answers:

4

I want to find possible candidate duplicate records in a large database matching on fields like COMPANYNAME and ADDRESSLINE1

Example:

For a record with the following COMPANYNAME:

  • "Acme, Inc."

I would like for my query to spit out other records with these COMPANYNAME values as possible dups:

  • "Acme Corporation"
  • "Acme, Incorporated"
  • "Acme"

I know how to do the joins, correlated subqueries, etc. to do the mechanics of pulling the set of data I want. And I know that has been covered on here before. I am interested hearing thoughts on the best way to do the fuzzy searching - should I use full-text indexing or the soundex function or something else that I am unware of for this process? (I am using SQL Server 2005)

Any help is appreciated!

+1  A: 

I would recommend using an SSIS task to periodically clean up the data. SSIS has fuzzy matching operators, and there are third party providers that offer more powerfull components. Some articles on the topic:

If the budget permits and the size of operation is worth it, you can even consider an MDS server: SQL Server 2008 R2 Master Data Services.

Remus Rusanu
I've used this and it works really well for finding possible duplicates
HLGEM
+1  A: 

It will of course depend on your exact requirements, but using CONTAINS in your SQL gives you the ability to carry out proximity searches, as well as thematic and fuzzy searches.

http://www.developer.com/db/article.php/3446891/Understanding-SQL-Server-Full-Text-Indexing.htm

http://msdn.microsoft.com/en-us/library/ms187787(SQL.90).aspx

davek
A: 

Take a look at Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/

Ira Warren Whiteside
A: 

Also a new SSIS Data Quality Toolkit is available at http://www.melissadata.com/dqt/total-data-quality-integration.htm

Ira Warren Whiteside