views:

984

answers:

2

Does anyone have an elegant sql statement to delete duplicate records from a table, but only if there are more than x number of duplicates? So it allows up to 2 or 3 duplicates, but that's it?

Currently I have a select statement that does the following:

delete table
from table t
left outer join (
 select max(id) as rowid, dupcol1, dupcol2
 from table
 group by dupcol1, dupcol2
) as keeprows on t.id=keeprows.rowid
where keeprows.rowid is null

This works great. But now what I'd like to do is only delete those rows if they have more than say 2 duplicates.

Thanks

+2  A: 

HAVING is your friend

select id, count(*) cnt from table group by id having cnt>2

SquareCog
+4  A: 
with cte as (
  select row_number() over (partition by dupcol1, dupcol2 order by ID) as rn
     from table)
delete from cte
   where rn > 2; -- or >3 etc

The query is manufacturing a 'row number' for each record, grouped by the (dupcol1, dupcol2) and ordered by ID. In effect this row number counts 'duplicates' that have the same dupcol1 and dupcol2 and assigns then the number 1, 2, 3.. N, order by ID. If you want to keep just 2 'duplicates', then you need to delete those that were assigned the numbers 3,4,.. N and that is the part taken care of by the DELLETE.. WHERE rn > 2;

Using this method you can change the ORDER BY to suit your preferred order (eg. ORDER BY ID DESC), so that the LATEST has rn=1, then the next to latest is rn=2 and so on. The rest stays the same, the DELETE will remove only the oldest ones as they have the highest row numbers.

Unlike this closely related question, as the condition becomes more complex, using CTEs and row_number() becomes simpler. Performance may be problematic still if no proper access index exists.

Remus Rusanu
Thanks Remus, but as I'm not an sql expert and not as familiar w/ 2005 specific keywords, could you explain to me what the query is doing?I presume partition is a nice shortcut for left joining to a grouped table, similar to my first example?? So your second line is returning a new id of all duplicate records based on the columns provided?Is rn the number of times that the row had been duplicated based on the columns in the second line?Thanks.
Scott
The query is manufacturing a 'row number' for each record, grouped by the (dupcol1, dupcol2) and ordered by ID. In effect this row number counts 'duplicates' that have the same dupcol1 and dupcol2 and assigns then the number 1, 2, 3.. N, order by ID. If you want to keep just 2 'duplicates', then you need to delete those that were assigned the numbers 3,4,.. N and that is the part taken care of by the DELLETE.. WHERE rn > 2; HTH, let me know if is still unclear.
Remus Rusanu
Nope, I gotcha, thanks a lot. One last thing though, I want to ensure that I'm always keeping the LATEST record. So, if I keep records with say <2 duplicates, and then throw out all the others, how can I modify the query to ensure that the most recent (max(id)) two or three records of the table are preserved.As an example: say one record is listed in our system 10 times. This violates the "2" duplicate rule. We'd like to remove 7 of the duplicates, leaving only one master record and two duplicates. By master record, we mean the LATEST (most up to date) record that went into the system.
Scott
you change the ORDER BY to suit your preferred order (eg. ORDER BY ID DESC), so that the LATEST has rn=1, then the next to latest is rn=2 and so on. The rest stays the same, the DELETE will remove only the oldest ones as they have the highest row numbers.
Remus Rusanu
Thanks again Remus, this is very helpful!
Scott
@Remus - your explanations really added to the answer so I popped them up in there.
Rob Allen