views:

605

answers:

6

I have a DB table which consists of 2.5 billion records. There are duplicates to the tune of 11 million. Which is the fastest way to delete these 11 million records?

+1  A: 

First put an index on the column or columns that define and contain the duplicate values,

Then, assumimg the table has a primary key (PK),

  Delete Table T Where PK <> 
        (Select Min(PK) From Table
         Where ColA = T.ColA
           ...  for each column in set defined above
           And ColB = T.ColB)

NOTE: could also use Max(PK), all you're doing is identifying a single record to not delete from each set of duplicates

EDIT: To eliminate the extensive use of the transaction log and the UNDO partition, you could store the values that are dupes in a temp table, and then delete the dupes for each pair within a single transaction...

Assuming only one column (call it ColA, a number) defines the dupes...

   Create Table Dupes (ColA Number)
   Insert Dupes(ColA)
   Select Distinct ColA
   From Table
   Group By ColA
   Having Count(*) > 1

   recordExists Number := 0 ;
   ColAValue Number;
   Select Case When Exists (Select Count(*) From Dupes)
   Then 1 Else 0 End Into recordExists From Dual;


   While recordExists = 1 
      Loop 
         Select (Select Max(ColA) From Dupes) 
         Into ColAValue From Dual;
         Begin Transaction
            Delete Table T
            Where ColA = ColAValue
               And pk <> (Select Min(Pk) From Table 
                          Where ColA = ColAValue);
            Delete Dupes Where ColA = ColAValue;
         Commit Transaction;
         Select Case When Exists (Select Count(*) From Dupes)
         Then 1 Else 0 End Into recordExists From Dual;
      End Loop;

Not tested, so syntax may neeed massaging...

Charles Bretana
A: 

If you're sure that you don't alter the integrity of the data (referential integrity), disable the constraints (indexes, other constraints), perform the delete, then enable the constraints. You have to try it first, to see whether the refresh of the indexes when enabling is less time consuming than the delete with them enabled.

Some query optimization might also help, but without knowing more details, we are discussing theoretically.

Cătălin Pitiș
Don't drop the index on the columns you are using to find duplicates, doing repeated full table scans of 2,500,000,000 rows will be very very very slow.
Richard
It won't do repeated table scans, it will do hash semi joins if there are no indexes.
Quassnoi
+2  A: 
DELETE
FROM    mytable
WHERE   rowid IN
        (
        SELECT  rowid
        FROM    (
                SELECT  rowid, ROW_NUMBER() OVER (ORDER BY dupfield) rn
                FROM    mytable r
                )
        WHERE   rn > 1
        )

or maybe even this:

DELETE
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable mi
        WHERE   mi.dup_field = mo.dup_field
                AND mi.rowid <> mo.rowid
        )

Both these queries will use quite efficient HASH SEMI JOIN, the latter will be faster if there is no index on dup_field.

You may be tempted to copy the rows, but note that much more REDO and UNDO information will be generated when copying 2G rows than when deleting 11M.

Quassnoi
how's the performance of an update like this when the table size is 2.5 billion?
Mr. Shiny and New
I have the feeling that this query is dog slow, but could achieve what the OP needs. Can this be rewritten as a join?
Manuel Ferreria
There will be a sort on `dupfield` (if there is no index on it), which can take long. The join on `rowid` will be a `HASH SEMI JOIN`, which is a matter of minutes on `2G` against `11M` rows. Delete itself will also take dozens of minutes, mainly to generate `REDO` and `UNDO`.
Quassnoi
@Manuel: `Oracle` is smart enough to rewrite it as even more efficient `HASH SEMI JOIN`.
Quassnoi
+16  A: 

Deleting one duplicate from many is a tricky business, and with that many records, you have a problem.

One option is to turn the problem on its head, and copy the records you want to keep into a new table. You can use CREATE TABLE AS SELECT DISTINCT ... NOLOGGING syntax, which will copy your de-duplicated records without using the transaction log, which is much faster. Once your new table is populated, delete/rename the old one, and rename the new one into place.

See http://www.databasejournal.com/features/oracle/article.php/3631361/Managing-Tables-Logging-versus-Nologging.htm

Oh, and remember to slap a UNIQUE index on the new table so this doesn't happen again.

The moral of the story is... never use DELETE for getting rid of large numbers of records, it's horrifyingly slow because it has to store all of the deleted records in the redo log. Either copy-and-switch, or TRUNCATE.

skaffman
... and you could apply the same algorithm to the team who produced the product which allowed only 11,000,000 duplicate rows ;-) Keith.
corlettk
+1 for this answer. I'd definitely be tempted to create a new copy of the table and insert across into that. The key thing I'd add, is don't put any indexes on that secondary table until you've copied the data over - you don't want the unnecessary hit of it having to keep indexes up to scratch while inserting the data. I also like this approach as it has an extra safety net - you don't have to get rid of the old table until your 100% sure you've got all the right data over.
AdaTheDev
it would be interesting to compare the time it takes to copy 2.489 Billion records vs deleting 11 Million, using the same predicates
Charles Bretana
Yes, this is a pretty extreme example. It would depend a lot on the disk throughput. I still reckon it'll be much faster, though.
skaffman
`CREATE TABLE NOLOGGING AS SELECT` will use the transaction log just as any other `DML` operation. `NOLOGGING` part applies only to `CREATE TABLE` itself. Populating the table will require `2G` worth or redo, which is *much* more than `11M` for deleted records.
Quassnoi
@Quassnoi: That's incorrect. See http://www.databasejournal.com/features/oracle/article.php/3631361/Managing-Tables-Logging-versus-Nologging.htm
skaffman
@skaffman: the time difference for the queries from the article is most probably due to the cache warmup by the previous queries. If you will try to reproduce the queries with something like `CREATE TABLE mytable (id, name) NOLOGGING AS SELECT level, RPAD('a', 100, 'a') FROM dual CONNECT BY level <= 5000000`, you will see no performance difference between `LOGGING` and `NOLOGGING`. Both these operations will generate a significant amount of `UNDO` (and hence `REDO`) on data dictionary, which, combined with the time to write the data, will most probably be less efficient than just a `DELETE`.
Quassnoi
I have to disgree again. I've seen this in action, and it works.
skaffman
@skaffman, if by "it works" you mean that it is faster, then making a reference to seeing it "in action" detracts from your case, as (you should know) query performance "in action" is not dependant only on the query, but can vary widely depending on state of the server (previous queries, cached data, etc..) So without extensive metrics and qualifiers to remove the influence of those other factors, all you are presenting is subjective impressions...
Charles Bretana
I'm sorry if you consider experience of an issue to be a detracter. However, I've already posted independent metrics to support my case. I would have thought the combination was compelling, but feel free to rely instead on gut feeling and hyperbolae.
skaffman
@skaffman, sorry I made your hackles elevate, didn't intend that... please wait to get defensive until someone attacks you ... I'm not. I'm just disagreeing with the conclusion you're reaching from the "experience" you relate. Unless you've done exactly what this individual is trying to do, using both options, more than a statistically significant number of times, (and I can't say that I have either), then your "experience" is just anecdotal.
Charles Bretana
I would disagree or amend your moral "never use DELETE for getting rid of large numbers of records", without defining what "large" is. 11 million may not be large if it's only .5% of the total data in the table, as we have here. It's impossible to tell without knowing more about the table structure and the conditions under which the removal needs to be performed. See my answer for an elaboration.
Steve Broberg
@Steve, exactly!
Charles Bretana
+1  A: 

Whether to delete existing rows or create a proper new table and drops the old one is faster depends on a lot of factors. 11 million rows is a lot, but it's only 0.5% of the total number of rows in the table. It's quite possible that the recreate & drop could be much slower than the delete, depending on how many indexes exist on the source table, as well as where the rows that need deleting exist on the data pages.

Then there's the issue of whether the source table is live or not. If there are inserts & updates going on while this cleanup is occurring, the copy & drop isn't going to work without a fair amount of extra code to sync the table up after the fact.

Finally, why is it necessary that this operation be "fast"? Is it because the system needs to be offline while the process is occurring? You could write a procedure that removes the dupes while the system is live, but doesn't impact the rest of the system in terms of consuming undo. We have solved this problem in the past by first writing a query that collects the primary keys of the rows to be removed in a second table, like so:

  INSERT
    INTO RowsToDeleteTable
  SELECT PKColumn
    FROM SourceTable
   WHERE <conditions used to find rows to remove>

CREATE UNIQUE INDEX PK_RowsToDelete ON RowsToDeleteTable (PKColumn);

Then we have a PL/SQL block that either loops over the rows in a cursor like so:

BEGIN
  FOR theRow IN (SELECT PKColumn FROM RowsToDeleteTable ORDER BY 1) LOOP
    <delete source table for theRow.PKColumn)
    <optionally wait a bit>
    commit;
  END LOOP;
END;

or does something like this:

BEGIN
  FOR theRow IN (SELECT MIN(PKColumn) FROM RowsToDeleteTable ) LOOP
    <delete source table for theRow.PKColumn)
    <optionally wait a bit>
    DELETE RowsToDeleteTable
     WHERE PKColumn = theRow.PKColumn;
    commit;
  END LOOP;
END;

The looping and "SELECT MAX" is obviously less efficient, but it has the advantage of allowing you to follow the progress of the delete operation. We put a bit of wait code in the loop to allow us to control how vigorously the reaping operation occurs.

The initial creation of the RowsToDeleteTable goes very quickly, and you have the advantage of allowing the process to take as long as you want. In case like this, the "holes" left in your extents by the deletes won't be too bad, since you're deleting such a small percentage of the total data.

Steve Broberg
A: 

Thanks guys! I had to copy the unique records into a table, truncate the original table and copy back the unique data.

Chattz