ansaurus

Question

Fastest technique to deleting duplicate data

Answer 1

+1 A:

What about EXISTS:

DELETE FROM sourceTable
 WHERE EXISTS(SELECT NULL
                FROM #dupTemp dt
               WHERE sourceTable.a = dt.a 
                 AND sourceTable.b = dt.b 
                 AND sourceTable.c = dt.c 
                 AND sourceTable.d = dt.d 
                 AND sourceTable.e = dt.e 
                 AND sourceTable.f = dt.f 
                 AND sourceTable.g = dt.g 
                 AND sourceTable.h = dt.h 
                 AND sourceTable.i = dt.i 
                 AND sourceTable.PriKey < dt.MaxPriKey)

OMG Ponies 2010-08-17 22:01:59

Please explain why you think this way would be faster.

subt13 2010-08-17 22:13:44

OMG Ponies 2010-08-17 22:16:18

Do all of the columns within EXISTS() need to be non-null?

subt13 2010-08-17 22:31:38

@subt13: No, but a column if NULL is present in the data that you're removing - that'd be good to know for performance sake only.

OMG Ponies 2010-08-17 22:34:21

Could this be combined with Andomar's batch delete answer and a_horse_with_no_name's real table vs temp table answer?

subt13 2010-08-17 22:46:08

Answer 2

A:

Well lots of differnt things. First would something like this work (do a select o make sure, maybe even put into a temp table of it's own, #recordsToDelete):

delete  
from sourceTable 
left join #dupTemp on   
       sourceTable.PriKey = #dupTemp.MaxPriKey   
where #dupTemp.MaxPriKey  is null

Next you can index temp tables, put an index on prikey

If you have records in a temp table of the ones you want to delete, you can delete in batches which is often faster than locking up the whole table with a delete.

HLGEM 2010-08-17 22:04:50

When dealing with non-null columns, `NOT IN` and `NOT EXISTS` are more efficient: http://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server/

OMG Ponies 2010-08-17 22:11:10

Answer 3

+1 A:

The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:

while 1=1
    begin

    delete top 100
    from sourceTable 
    ...

    if @@rowcount = 0
        break
    end

Andomar 2010-08-17 22:10:04

That's a very interesting idea. I will for sure try this.

subt13 2010-08-17 22:13:10

BTW: I don't think delete top 100 is valid syntax

subt13 2010-08-17 22:59:31

@subt13: It is - see [SQL Server 2008 BOL - DELETE](http://msdn.microsoft.com/en-us/library/ms189835.aspx)

OMG Ponies 2010-08-17 23:11:12

Answer 4

+2 A:

Can you afford to have the original table unavailable for a short time?

I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.

Then drop the original table and rename the intermediate table to have the same name as the old table.

a_horse_with_no_name 2010-08-17 22:15:46

Yes. Is a regular table faster than a temp table or something? Please excuse my ignorance :)

subt13 2010-08-17 22:25:39

Probably going to be the quickest solution proposed thus far - if there are foreign keys etc. this gets painful and prone to error if you're not careful, but definitely worth consideration.

Will A 2010-08-17 22:27:11

@subt13: you need the regular table because you are going to keep it ;) (in contrast to your temp table)@WillA: yes you are right, one needs to be careful with constraints.

a_horse_with_no_name 2010-08-17 22:32:00

I see. I don't have to worry about contraints or foreign keys yet.

subt13 2010-08-17 22:42:47

Answer 5

A:

Here's a version where you can combine both steps into a single step.

WITH cte AS
    ( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
        prikey DESC) AS sequence
    FROM sourceTable
    )

DELETE
FROM sourceTable
WHERE prikey IN
    ( SELECT prikey
    FROM cte
    WHERE sequence > 1
    ) ;

By the way, do you have any indexes that can be temporarily removed?

bobs 2010-08-17 22:16:12

Martin Smith showed the the other day that the CTE can be referenced as the DELETE source, functioning like an updateable view.

OMG Ponies 2010-08-17 22:17:44

Ya, this is a cool feature I just wasn't sure about the efficiency compared to an old fashion #temp table. It takes a while to do anything on this many rows. I have a clustered index. If more are needed I can certainly add them.

subt13 2010-08-17 22:24:33

Answer 6

+1 A:

...based on OMG Ponies comment above, a CTE method that's a little more compact. This method works wonders on tables where you've (for whatever reason) no primary key - where you can have rows which are identical on all columns.

;WITH cte AS (
 SELECT ROW_NUMBER() OVER 
          (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
    FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1

Will A 2010-08-17 22:23:50

Cool. I thought I was helping out, and I end up getting helped. This is a better performer than my suggestion.

bobs 2010-08-17 22:35:51

This is very compact, but I'm more interested in speed. From what I've read and seen with ctes, they are merely syntactical sugar in my case. Please correct me if I'm wrong, however.

subt13 2010-08-17 22:49:15

@subt13: You'll have to let us know after comparing the actual query plan between the various options.

OMG Ponies 2010-08-17 23:12:08

ansaurus

tags:

views:

answers:

Fastest technique to deleting duplicate data

related questions