views:

247

answers:

7

I tried to delete duplicate records in my DB. The only difference is the PrimaryKey which is a uniqueidentifier. I have about 1500 entries that have been duped so in all I'm looking at around 3000 entries. So I sectioned off about 60 entries (based on the receivedOn date) and executed my code to pare them down to 30 and OH CRAP the 30 disappeared! Here is the code I tried:

DELETE dupes
FROM [emailTable] dupes, [emailTable] fullTable
WHERE (dupes.ReceivedOn > '2009-08-18 23:59:59.999' AND dupes.ReceivedOn < '2009-08-20 00:00:00.000')
      AND (dupes.emlPath = fullTable.emlPath)
      AND NOT (dupes.GUID = fullTable.GUID)

My goal is to delete the duplicate. I dont care which one... but I need ONE of the two entries to stay on the server... Can anybody shed some light on what I did wrong?

+2  A: 

You should do the delete with a subselect, not a join.

The benefit of doing it this way, is you can preview the GUID's you will delete before you actually delete them. (just run the select query by it self)

This outta do it, it will delete the smallest GUID

 delete from emailTable where GUID in
 (

  select MIN(dupe.GIUD) from emailTable dupe
    INNER JOIN emailTable noDupe 
   ON dupe.emlPath=noDupe.emlPath 
   where recievedOn between '2009-8-18' and '2009-8-20'
               GROUP BY dupe.emlPath
 )
Byron Whitlock
+1  A: 

What you did wrong is that your query doesn't exclude any of the duplicates. It picks out the duplicates that are different from another duplicate with the same path, but every duplicate is different from another duplicate.

What you have to do is to first pick out the duplicates that you want to keep, for example:

select min(GUID)
from emailTable
where ReceivedOn > '...' and ReceivedOn < '...'
group by emlPath
having count(*) > 1

Then you delete all duplicates except those.

Guffa
OMG Ponies
@rexem, beware of having a tie in the values being ranked. Use ROW_NUMBER() instead.
Jeff O
+4  A: 

You can do this without a second table. Something like this:

SELECT * FROM emailTable
WHERE EXISTS (
    SELECT * FROM emailTable AS t2
    WHERE t2.emlPath = emailTable.emlPath AND
    t2.GUID > emailTable.GUID)

That will show you which records are about to get deleted. If that's okay, change it to:

DELETE FROM emailTable
WHERE EXISTS (
    SELECT * FROM emailTable AS t2
    WHERE t2.emlPath = emailTable.emlPath AND
    t2.GUID > emailTable.GUID)

The t2.GUID > emailTable.GUID will make sure that one record with that emlPath will remain in the table.

Thorarin
A: 

You should not use "=" in your join. ie "AND NOT (dupes.GUID = fullTable.GUID)" This condition won't do anything since the GUID of your duplicate rows must be different.

You should use greater than. ie

delete from emailTable 
WHERE EXISTS
(
    SELECT ID FROM emailTable t2
    WHERE emailTable.GUID > t2.GUID
    AND emailTable.emlPath= t2.emlPath
)
David
A: 

I prefer to use a common table expression for this and ROW_NUMBER():

with cte as (
   select row_number() over (partition by emlPath order by GUID) as eml_no
      , ReceivedOn
   from [emailTables])
delete from cte
   where eml_no > 1
   and ReceivedOn between '2009-08-18 23:59:59.999' AND '2009-08-20 00:00:00.000';

I preffer this because it gives stirct control over which duplicate row is deleted. I can delete the third one and keep two, I can choose whatever order number I want to keep the first one, and it deals fine with ties.

Remus Rusanu
A: 

This was the code that I ended up at thanks to the help of all the posts:

DELETE A
  FROM [emailTable] A, [emailTable] B
  WHERE A.MessageID = B.MessageID
        AND A.GUID > B.GUID
swolff1978
A: 

i want to delete duplicate record but i have no primary key . just as example i have three columns (id ,name,address) and have three same record in a table such as(1,Bob,London).I want to delete all duplicate record but remains one record in a table.

Navkul