views:

746

answers:

3

I have a project where a significant piece of the pie will be to identify where a record is duplicated in the database (Sql Server 2005). I know the obvious ways to find a duplicate record. However, in this case, we want to be fairly smart about the process. The table(s) will contain information about a prospective customer(lead). The initial tables will accept all leads. We will then go through a dupe process which will check if the lead is a duplicate by matching on several fields. For example, we may want to match on the last name, first name, email and zip code. This is just an example, but essentially we want to create a key using various fields to know whether this is person exists. The records that are not dupes will go into a final table.

I would like to use SSIS for this, but am not sure the best way to use SSIS to accomplish this. Can someone steer me in the right direction or provide a link to an example that uses SSIS to deal with dupes by checking a combination of fields?

+1  A: 
-- List all Duplicates
select m1.lastname, m1.firstname, m1.email, m1.zipcode
from tblMain m1
inner join tblMain m2
on isnull(m1.lastname, '') = isnull(m2.lastname, '')
and isnull(m1.firstname, '') = isnull(m2.firstname, '')
and isnull(m1.email, '') = isnull(m2.email, '')
and isnull(m1.zipcode, '') = isnull(m2.zipcode)
and m1.ID <> m2.ID
order by 1, 2, 3, 4

To Delete the latest duplicates, use something like:

delete from tblMain
where ID in 
(
    select m1.ID
    from tblMain m1
    inner join tblMain m2
    on isnull(m1.lastname, '') = isnull(m2.lastname, '')
    and isnull(m1.firstname, '') = isnull(m2.firstname, '')
    and isnull(m1.email, '') = isnull(m2.email, '')
    and isnull(m1.zipcode, '') = isnull(m2.zipcode)
    and m1.ID > m2.ID
)
Gordon Bell
A: 

I don't understand how can you be sure SSIS is the answer to your problem. Why can't you simply create unique keys in your "final" table to ensure you are not adding duplicates? Perhaps you should explain your problem better...

Sam
+2  A: 

It seems to me you are trying to outwit your users by trying to second guess them. Unfortunately, this almost always never works, as you may indeed have two clients with the same name, but different zipcodes, or other examples like that.

Your best bet is to "suggest" that the customer they are about to save already exists (and show them the duplicate) but allow them to save anyway. So the process needs to probably look like this:

  1. User enters info in and presses Save
  2. System detects a potential duplicate using criteria, and prompts user
  3. User will either cancel or confirm, upon which you take the appropriate action

If there is no potential duplicate then steps 2-3 can be safely skipped.

hova
Brings to mind:Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots. So far, the Universe is winning.-- Rick Cook, The Wizardry Compiled
marc_s
That is what I've done in one of our web applications and it works very well.
WakeUpScreaming