After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.
In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:
-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
MAX(prikey) as MaxPriKey, -- identity(1, 1)
a,
b,
c,
d,
e,
f,
g,
h,
i
into #dupTemp
FROM sourceTable
group by
a,
b,
c,
d,
e,
f,
g,
h,
i
having COUNT(*) > 1
Next,
-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete
from sourceTable
from sourceTable
inner join #dupTemp on
sourceTable.a = #dupTemp.a and
sourceTable.b = #dupTemp.b and
sourceTable.c = #dupTemp.c and
sourceTable.d = #dupTemp.d and
sourceTable.e = #dupTemp.e and
sourceTable.f = #dupTemp.f and
sourceTable.g = #dupTemp.g and
sourceTable.h = #dupTemp.h and
sourceTable.i = #dupTemp.i and
sourceTable.PriKey != #dupTemp.MaxPriKey
Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.
Thanks so much.
UPDATE:
I had to stop step 2 from running at the 9 hour mark.
I tried OMG Ponies' method and it finished after only 40 minutes.
I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it.
UPDATE:
Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.
I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.