views:

75

answers:

6

After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.

In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:

-- Step 1:  **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
    MAX(prikey) as MaxPriKey, -- identity(1, 1)
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
into #dupTemp
FROM sourceTable
group by
    a,
    b,
    c,
    d,
    e,
    f,
    g,
    h,
    i
having COUNT(*) > 1

Next,

-- Step 2: **This step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete 
from sourceTable
from sourceTable
inner join #dupTemp on  
    sourceTable.a = #dupTemp.a and
    sourceTable.b = #dupTemp.b and
    sourceTable.c = #dupTemp.c and
    sourceTable.d = #dupTemp.d and
    sourceTable.e   = #dupTemp.e and
    sourceTable.f = #dupTemp.f and
    sourceTable.g = #dupTemp.g and
    sourceTable.h = #dupTemp.h and
    sourceTable.i   = #dupTemp.i and
    sourceTable.PriKey != #dupTemp.MaxPriKey  

Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.

Thanks so much.

UPDATE:
I had to stop step 2 from running at the 9 hour mark. I tried OMG Ponies' method and it finished after only 40 minutes. I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it. UPDATE: Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.

I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.

+1  A: 

What about EXISTS:

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)
OMG Ponies
Please explain why you think this way would be faster.
subt13
OMG Ponies
Do all of the columns within EXISTS() need to be non-null?
subt13
@subt13: No, but a column if NULL is present in the data that you're removing - that'd be good to know for performance sake only.
OMG Ponies
Could this be combined with Andomar's batch delete answer and a_horse_with_no_name's real table vs temp table answer?
subt13
A: 

Well lots of differnt things. First would something like this work (do a select o make sure, maybe even put into a temp table of it's own, #recordsToDelete):

delete  
from sourceTable 
left join #dupTemp on   
       sourceTable.PriKey = #dupTemp.MaxPriKey   
where #dupTemp.MaxPriKey  is null

Next you can index temp tables, put an index on prikey

If you have records in a temp table of the ones you want to delete, you can delete in batches which is often faster than locking up the whole table with a delete.

HLGEM
When dealing with non-null columns, `NOT IN` and `NOT EXISTS` are more efficient: http://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/
OMG Ponies
+1  A: 

The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:

while 1=1
    begin

    delete top 100
    from sourceTable 
    ...

    if @@rowcount = 0
        break
    end
Andomar
That's a very interesting idea. I will for sure try this.
subt13
BTW: I don't think delete top 100 is valid syntax
subt13
@subt13: It is - see [SQL Server 2008 BOL - DELETE](http://msdn.microsoft.com/en-us/library/ms189835.aspx)
OMG Ponies
+2  A: 

Can you afford to have the original table unavailable for a short time?

I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.

Then drop the original table and rename the intermediate table to have the same name as the old table.

a_horse_with_no_name
Yes. Is a regular table faster than a temp table or something? Please excuse my ignorance :)
subt13
Probably going to be the quickest solution proposed thus far - if there are foreign keys etc. this gets painful and prone to error if you're not careful, but definitely worth consideration.
Will A
@subt13: you need the regular table because you are going to keep it ;) (in contrast to your temp table)@WillA: yes you are right, one needs to be careful with constraints.
a_horse_with_no_name
I see. I don't have to worry about contraints or foreign keys yet.
subt13
A: 

Here's a version where you can combine both steps into a single step.

WITH cte AS
    ( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
        prikey DESC) AS sequence
    FROM sourceTable
    )

DELETE
FROM sourceTable
WHERE prikey IN
    ( SELECT prikey
    FROM cte
    WHERE sequence > 1
    ) ;

By the way, do you have any indexes that can be temporarily removed?

bobs
Martin Smith showed the the other day that the CTE can be referenced as the DELETE source, functioning like an updateable view.
OMG Ponies
Ya, this is a cool feature I just wasn't sure about the efficiency compared to an old fashion #temp table. It takes a while to do anything on this many rows. I have a clustered index. If more are needed I can certainly add them.
subt13
+1  A: 

...based on OMG Ponies comment above, a CTE method that's a little more compact. This method works wonders on tables where you've (for whatever reason) no primary key - where you can have rows which are identical on all columns.

;WITH cte AS (
 SELECT ROW_NUMBER() OVER 
          (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
    FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1
Will A
Cool. I thought I was helping out, and I end up getting helped. This is a better performer than my suggestion.
bobs
This is very compact, but I'm more interested in speed. From what I've read and seen with ctes, they are merely syntactical sugar in my case. Please correct me if I'm wrong, however.
subt13
@subt13: You'll have to let us know after comparing the actual query plan between the various options.
OMG Ponies