views:

3122

answers:

5

I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex, lastname and firstname where addressindex is a unique address drilled down to the door of the apartment. So if I have 'like below' two persons with the lastname and one the firstnames are the same they are most likely duplicates.

I need a way to list these duplicates.

tabledata:

personid     1
firstname    "Carl"
lastname     "Anderson"
addressindex 1

personid     2
firstname    "Carl Peter"
lastname     "Anderson"
addressindex 1

I know how do this if I were to match exactly on all columns but I need fuzzy match to do the trick with (from the above example) a result like:

Row     personid      addressindex     lastname     firstname
1       2             1                Anderson     Carl Peter
2       1             1                Anderson     Carl
.....

Any hints on how to solve this in a good way?

+2  A: 

I would use SQL Server Full Text Indexing, which will allow you to do searches and return things that not only contain the word but also may have a misspelling.

Russ Bradberry
here is a nice article on it: http://www.developer.com/db/article.php/3446891
Russ Bradberry
Thand, i have considered it bit se use standard edition and full text search isn't an option here.
Frederik
Full Text Search is available in all editions of SQL Server 2005 and 2008
Russ Bradberry
ok, then I will consider it.
Frederik
A: 

You can use the SOUNDEX and related DIFFERENCE function in SQL Server to find similar names. The reference on MSDN is here.

Matt Spradley
+6  A: 

You might find this helpful:

http://anastasiosyal.com/archive/2009/01/11/18.aspx

It provides an introduction to SOUNDEX and also gives step by step instructions for setting up an open source plugin that's reported to work a little better.

Joel Coehoorn
+1 for the nice link!
RedFilter
soundex is, I believe, standard and doesn't require any special database setup but isn't all that accurate (IMHO). Full Text Indexing gives you more features but takes more to setup.
Zack
Thanx, this solvesy problem better than my own attempt. Thanx Chris for your thourough explanation of the library and how to functionize it in SQL server. SimMetrics is a great library.
Frederik
+2  A: 

In addition to the other good info here, you might want to consider using the Double Metaphone phonetic algorithm which is much superior to SOUNDEX. There is a Transact-SQL version.

That will assist in matching names with slight misspellings, e.g., Carl vs. Karl.

http://www.sqlmag.com/Articles/ArticleID/26094/pg/1/1.html

RedFilter
A: 

Thanks all for the good advices. In the end you need sum the pros and cons to make good judgement on which path to take. I choose to make a split function to split the names up and then Do an exact match. I will add an example shortly that i would appreciate comments on.

Frederik