ansaurus

Question

DELETE SQL with correlated subquery for table with 42 million rows?

Answer 1

+4 A:

There is no practical threshold. It depends on what your command timeout is set to on your connection.

Keep in mind that the time it takes to delete all of these rows is contingent upon:

The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock

The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.

Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.

As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.

Dave Markle 2010-08-06 23:01:22

Well, I should have much contention on this box. I'm not setting `CommandTimeout`, so I guess I'm using the default value of 30 seconds. Also, .ldf shares same disk as .mdf, but I can probably change that. This is a batch load process, and that `DELETE` is only done when a web service call times out and I need to reload cats just for the owner I was currently loading.

JohnB 2010-08-06 23:08:47

Answer 2

+2 A:

There's no threshold as such - you can DELETE all the rows from any table given enough transaction log space - which is where your query is most likely falling over. If you're getting some results from your DELETE TOP (n) PERCENT FROM cats WHERE ... then you can wrap it in a loop as below:

SELECT 1
WHILE @@ROWCOUNT <> 0
BEGIN
 DELETE TOP (somevalue) PERCENT FROM cats
 WHERE cats.id_cat IN (
 SELECT owner_cats.id_cat FROM owner_cats
 WHERE owner_cats.id_owner = 1)
END

Will A 2010-08-06 23:02:12

Answer 3

+2 A:

As others have mentioned, when you delete 42 million rows, the db has to log 42 million deletions against the database. Thus, the transaction log has to grow substantially. What you might try is to break up the delete into chunks. In the following query, I use the NTile ranking function to break up the rows into 100 buckets. If that is too slow, you can expand the number of buckets so that each delete is smaller. It will help tremendously if there is an index on owner_cats.id_owner, owner_cats.id_cats and cats.id_cat (which I assumed the primary key and numeric).

Declare @Cats Cursor
Declare @CatId int  --assuming an integer PK here
Declare @Start int
Declare @End int
Declare @GroupCount int

Set @GroupCount = 100

Set @Cats = Cursor Fast_Forward For
    With CatHerd As
        (
        Select cats.id_cat
            , NTile(@GroupCount) Over ( Order By cats.id_cat ) As Grp
        From cats
            Join owner_cats
                On owner_cats.id_cat = cats.id_cat
        Where owner_cats.id_owner = 1
        )
        Select Grp, Min(id_cat) As MinCat, Max(id_cat) As MaxCat
        From CatHerd
        Group By Grp
Open @Cats
Fetch Next From @Cats Into @CatId, @Start, @End

While @@Fetch_Status = 0
Begin
    Delete cats
    Where id_cat Between @Start And @End

    Fetch Next From @Cats Into @CatId, @Start, @End
End 

Close @Cats
Deallocate @Cats

The notable catch with the above approach is that it is not transactional. Thus, if it fails on the 40th chunk, you will have deleted 40% of the rows and the other 60% will still exist.

Thomas 2010-08-06 23:15:53

Thanks, I might have to try this. But what do you think about my `TOP (25) PERCENT` idea?

JohnB 2010-08-06 23:29:51

@John B - The downside of the TOP X% solution is that you have to requery/reevaluate TOP X% on each iteration instead of just once like I've done here.

Thomas 2010-08-06 23:30:56

Answer 4

+3 A:

If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:

SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN (    -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

TRUNCATE TABLE cats

INSERT INTO cats
SELECT * FROM #cats_to_keep

Patrick Marchand 2010-08-06 23:29:30

Good suggestion!

JohnB 2010-08-06 23:49:20

Answer 5

+1 A:

If you use an EXISTS rather than an IN, you should get much better performance. Try this:

DELETE
  FROM cats c
 WHERE EXISTS (SELECT 1
                 FROM owner_cats o
                WHERE o.id_cat = c.id_cat
                  AND o.id_owner = 1)

Tom 2010-08-07 10:06:18

+1 it helps! With 42 million rows, still no indexes, my old way: 22:8 min:sec. Your way: 13:21. However, with 13 million rows (I have 2 owners) my old way: 2:10. Your way: 2:13. Great tip though, can you explain how it works please?

JohnB 2010-08-07 16:22:24

It's all about the optimiser handles things really, but basically with the IN clause the sub-select must be fully evaluated whereas with EXISTS, only the first row is needed.

Tom 2010-08-07 21:27:06

Without indexes you are always going to be stuffed here. At the very least you need to put an index on owner_cats.id_cat, then this EXISTS clause should be lightning fast.

Tom 2010-08-07 21:28:35

Answer 6

+1 A:

Have you tried no Subquery and use a join instead?

DELETE cats 
FROM
 cats c
 INNER JOIN owner_cats oc
 on c.id_cat = oc.id_cat
WHERE
   id_owner =1

And if you have have you also tried different Join hints e.g.

DELETE cats 
FROM
 cats c
 INNER HASH JOIN owner_cats oc
 on c.id_cat = oc.id_cat
WHERE
   id_owner =1

Conrad Frix 2010-08-10 21:06:29

+1 I have not, but I will now thanks!

JohnB 2010-08-11 23:19:14

Answer 7

A:

Bill Karwin's answer to another question applies to my situation also:

"If your DELETE is intended to eliminate a great majority of the rows in that table, one thing that people often do is copy just the rows you want to keep to a duplicate table, and then use DROP TABLE or TRUNCATE to wipe out the original table much more quickly."

Matt in this answer says it this way:

"If offline and deleting a large %, may make sense to just build a new table with data to keep, drop the old table, and rename."

ammoQ in this answer (from the same question) recommends (paraphrased):

issue a table lock when deleting a large amount of rows
put indexes on any foreign key columns

JohnB 2010-08-11 23:18:39

JohnB 2010-08-12 13:44:12

ansaurus

tags:

views:

answers:

DELETE SQL with correlated subquery for table with 42 million rows?

related questions