ansaurus

Question

How to delete faster?

Answer 1

+1 A:

First put an index on the column or columns that define and contain the duplicate values,

Then, assumimg the table has a primary key (PK),

  Delete Table T Where PK <> 
        (Select Min(PK) From Table
         Where ColA = T.ColA
           ...  for each column in set defined above
           And ColB = T.ColB)

NOTE: could also use Max(PK), all you're doing is identifying a single record to not delete from each set of duplicates

EDIT: To eliminate the extensive use of the transaction log and the UNDO partition, you could store the values that are dupes in a temp table, and then delete the dupes for each pair within a single transaction...

Assuming only one column (call it ColA, a number) defines the dupes...

   Create Table Dupes (ColA Number)
   Insert Dupes(ColA)
   Select Distinct ColA
   From Table
   Group By ColA
   Having Count(*) > 1

   recordExists Number := 0 ;
   ColAValue Number;
   Select Case When Exists (Select Count(*) From Dupes)
   Then 1 Else 0 End Into recordExists From Dual;


   While recordExists = 1 
      Loop 
         Select (Select Max(ColA) From Dupes) 
         Into ColAValue From Dual;
         Begin Transaction
            Delete Table T
            Where ColA = ColAValue
               And pk <> (Select Min(Pk) From Table 
                          Where ColA = ColAValue);
            Delete Dupes Where ColA = ColAValue;
         Commit Transaction;
         Select Case When Exists (Select Count(*) From Dupes)
         Then 1 Else 0 End Into recordExists From Dual;
      End Loop;

Not tested, so syntax may neeed massaging...

Charles Bretana 2009-07-17 13:01:42

Answer 2

A:

If you're sure that you don't alter the integrity of the data (referential integrity), disable the constraints (indexes, other constraints), perform the delete, then enable the constraints. You have to try it first, to see whether the refresh of the indexes when enabling is less time consuming than the delete with them enabled.

Some query optimization might also help, but without knowing more details, we are discussing theoretically.

Cătălin Pitiș 2009-07-17 13:02:17

Don't drop the index on the columns you are using to find duplicates, doing repeated full table scans of 2,500,000,000 rows will be very very very slow.

Richard 2009-07-17 13:04:33

It won't do repeated table scans, it will do hash semi joins if there are no indexes.

Quassnoi 2009-07-17 13:13:00

Answer 3

+2 A:

DELETE
FROM    mytable
WHERE   rowid IN
        (
        SELECT  rowid
        FROM    (
                SELECT  rowid, ROW_NUMBER() OVER (ORDER BY dupfield) rn
                FROM    mytable r
                )
        WHERE   rn > 1
        )

or maybe even this:

DELETE
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable mi
        WHERE   mi.dup_field = mo.dup_field
                AND mi.rowid <> mo.rowid
        )

Both these queries will use quite efficient HASH SEMI JOIN, the latter will be faster if there is no index on dup_field.

You may be tempted to copy the rows, but note that much more REDO and UNDO information will be generated when copying 2G rows than when deleting 11M.

Quassnoi 2009-07-17 13:02:19

how's the performance of an update like this when the table size is 2.5 billion?

Mr. Shiny and New 2009-07-17 13:15:11

I have the feeling that this query is dog slow, but could achieve what the OP needs. Can this be rewritten as a join?

Manuel Ferreria 2009-07-17 13:17:16

There will be a sort on `dupfield` (if there is no index on it), which can take long. The join on `rowid` will be a `HASH SEMI JOIN`, which is a matter of minutes on `2G` against `11M` rows. Delete itself will also take dozens of minutes, mainly to generate `REDO` and `UNDO`.

Quassnoi 2009-07-17 13:19:54

@Manuel: `Oracle` is smart enough to rewrite it as even more efficient `HASH SEMI JOIN`.

Quassnoi 2009-07-17 13:20:34

Answer 4

+16 A:

Deleting one duplicate from many is a tricky business, and with that many records, you have a problem.

One option is to turn the problem on its head, and copy the records you want to keep into a new table. You can use CREATE TABLE AS SELECT DISTINCT ... NOLOGGING syntax, which will copy your de-duplicated records without using the transaction log, which is much faster. Once your new table is populated, delete/rename the old one, and rename the new one into place.

See http://www.databasejournal.com/features/oracle/article.php/3631361/Managing-Tables-Logging-versus-Nologging.htm

Oh, and remember to slap a UNIQUE index on the new table so this doesn't happen again.

The moral of the story is... never use DELETE for getting rid of large numbers of records, it's horrifyingly slow because it has to store all of the deleted records in the redo log. Either copy-and-switch, or TRUNCATE.

skaffman 2009-07-17 13:03:18

... and you could apply the same algorithm to the team who produced the product which allowed only 11,000,000 duplicate rows ;-) Keith.

corlettk 2009-07-17 13:07:15

+1 for this answer. I'd definitely be tempted to create a new copy of the table and insert across into that. The key thing I'd add, is don't put any indexes on that secondary table until you've copied the data over - you don't want the unnecessary hit of it having to keep indexes up to scratch while inserting the data. I also like this approach as it has an extra safety net - you don't have to get rid of the old table until your 100% sure you've got all the right data over.

AdaTheDev 2009-07-17 13:11:41

it would be interesting to compare the time it takes to copy 2.489 Billion records vs deleting 11 Million, using the same predicates

Charles Bretana 2009-07-17 13:11:45

Yes, this is a pretty extreme example. It would depend a lot on the disk throughput. I still reckon it'll be much faster, though.

skaffman 2009-07-17 13:13:41

`CREATE TABLE NOLOGGING AS SELECT` will use the transaction log just as any other `DML` operation. `NOLOGGING` part applies only to `CREATE TABLE` itself. Populating the table will require `2G` worth or redo, which is *much* more than `11M` for deleted records.

Quassnoi 2009-07-17 13:27:18

@Quassnoi: That's incorrect. See http://www.databasejournal.com/features/oracle/article.php/3631361/Managing-Tables-Logging-versus-Nologging.htm

skaffman 2009-07-17 13:44:10

@skaffman: the time difference for the queries from the article is most probably due to the cache warmup by the previous queries. If you will try to reproduce the queries with something like `CREATE TABLE mytable (id, name) NOLOGGING AS SELECT level, RPAD('a', 100, 'a') FROM dual CONNECT BY level <= 5000000`, you will see no performance difference between `LOGGING` and `NOLOGGING`. Both these operations will generate a significant amount of `UNDO` (and hence `REDO`) on data dictionary, which, combined with the time to write the data, will most probably be less efficient than just a `DELETE`.

Quassnoi 2009-07-17 15:14:40

I have to disgree again. I've seen this in action, and it works.

skaffman 2009-07-17 15:28:24

@skaffman, if by "it works" you mean that it is faster, then making a reference to seeing it "in action" detracts from your case, as (you should know) query performance "in action" is not dependant only on the query, but can vary widely depending on state of the server (previous queries, cached data, etc..) So without extensive metrics and qualifiers to remove the influence of those other factors, all you are presenting is subjective impressions...

Charles Bretana 2009-07-17 17:41:39

I'm sorry if you consider experience of an issue to be a detracter. However, I've already posted independent metrics to support my case. I would have thought the combination was compelling, but feel free to rely instead on gut feeling and hyperbolae.

skaffman 2009-07-17 23:57:45

@skaffman, sorry I made your hackles elevate, didn't intend that... please wait to get defensive until someone attacks you ... I'm not. I'm just disagreeing with the conclusion you're reaching from the "experience" you relate. Unless you've done exactly what this individual is trying to do, using both options, more than a statistically significant number of times, (and I can't say that I have either), then your "experience" is just anecdotal.

Charles Bretana 2009-07-18 03:01:57

I would disagree or amend your moral "never use DELETE for getting rid of large numbers of records", without defining what "large" is. 11 million may not be large if it's only .5% of the total data in the table, as we have here. It's impossible to tell without knowing more about the table structure and the conditions under which the removal needs to be performed. See my answer for an elaboration.

Steve Broberg 2009-07-22 14:32:04

@Steve, exactly!

Charles Bretana 2009-07-22 16:45:42

Answer 5

+1 A:

Whether to delete existing rows or create a proper new table and drops the old one is faster depends on a lot of factors. 11 million rows is a lot, but it's only 0.5% of the total number of rows in the table. It's quite possible that the recreate & drop could be much slower than the delete, depending on how many indexes exist on the source table, as well as where the rows that need deleting exist on the data pages.

Then there's the issue of whether the source table is live or not. If there are inserts & updates going on while this cleanup is occurring, the copy & drop isn't going to work without a fair amount of extra code to sync the table up after the fact.

Finally, why is it necessary that this operation be "fast"? Is it because the system needs to be offline while the process is occurring? You could write a procedure that removes the dupes while the system is live, but doesn't impact the rest of the system in terms of consuming undo. We have solved this problem in the past by first writing a query that collects the primary keys of the rows to be removed in a second table, like so:

  INSERT
    INTO RowsToDeleteTable
  SELECT PKColumn
    FROM SourceTable
   WHERE <conditions used to find rows to remove>

CREATE UNIQUE INDEX PK_RowsToDelete ON RowsToDeleteTable (PKColumn);

Then we have a PL/SQL block that either loops over the rows in a cursor like so:

BEGIN
  FOR theRow IN (SELECT PKColumn FROM RowsToDeleteTable ORDER BY 1) LOOP
    <delete source table for theRow.PKColumn)
    <optionally wait a bit>
    commit;
  END LOOP;
END;

or does something like this:

BEGIN
  FOR theRow IN (SELECT MIN(PKColumn) FROM RowsToDeleteTable ) LOOP
    <delete source table for theRow.PKColumn)
    <optionally wait a bit>
    DELETE RowsToDeleteTable
     WHERE PKColumn = theRow.PKColumn;
    commit;
  END LOOP;
END;

The looping and "SELECT MAX" is obviously less efficient, but it has the advantage of allowing you to follow the progress of the delete operation. We put a bit of wait code in the loop to allow us to control how vigorously the reaping operation occurs.

The initial creation of the RowsToDeleteTable goes very quickly, and you have the advantage of allowing the process to take as long as you want. In case like this, the "holes" left in your extents by the deletes won't be too bad, since you're deleting such a small percentage of the total data.

Steve Broberg 2009-07-22 14:29:48

Answer 6

A:

Thanks guys! I had to copy the unique records into a table, truncate the original table and copy back the unique data.

Chattz 2010-05-25 07:11:43

ansaurus

tags:

views:

answers:

How to delete faster?

related questions