I have a DB table which consists of 2.5 billion records. There are duplicates to the tune of 11 million. Which is the fastest way to delete these 11 million records?
First put an index on the column or columns that define and contain the duplicate values,
Then, assumimg the table has a primary key (PK),
Delete Table T Where PK <>
(Select Min(PK) From Table
Where ColA = T.ColA
... for each column in set defined above
And ColB = T.ColB)
NOTE: could also use Max(PK), all you're doing is identifying a single record to not delete from each set of duplicates
EDIT: To eliminate the extensive use of the transaction log and the UNDO partition, you could store the values that are dupes in a temp table, and then delete the dupes for each pair within a single transaction...
Assuming only one column (call it ColA, a number) defines the dupes...
Create Table Dupes (ColA Number)
Insert Dupes(ColA)
Select Distinct ColA
From Table
Group By ColA
Having Count(*) > 1
recordExists Number := 0 ;
ColAValue Number;
Select Case When Exists (Select Count(*) From Dupes)
Then 1 Else 0 End Into recordExists From Dual;
While recordExists = 1
Loop
Select (Select Max(ColA) From Dupes)
Into ColAValue From Dual;
Begin Transaction
Delete Table T
Where ColA = ColAValue
And pk <> (Select Min(Pk) From Table
Where ColA = ColAValue);
Delete Dupes Where ColA = ColAValue;
Commit Transaction;
Select Case When Exists (Select Count(*) From Dupes)
Then 1 Else 0 End Into recordExists From Dual;
End Loop;
Not tested, so syntax may neeed massaging...
If you're sure that you don't alter the integrity of the data (referential integrity), disable the constraints (indexes, other constraints), perform the delete, then enable the constraints. You have to try it first, to see whether the refresh of the indexes when enabling is less time consuming than the delete with them enabled.
Some query optimization might also help, but without knowing more details, we are discussing theoretically.
DELETE
FROM mytable
WHERE rowid IN
(
SELECT rowid
FROM (
SELECT rowid, ROW_NUMBER() OVER (ORDER BY dupfield) rn
FROM mytable r
)
WHERE rn > 1
)
or maybe even this:
DELETE
FROM mytable mo
WHERE EXISTS
(
SELECT NULL
FROM mytable mi
WHERE mi.dup_field = mo.dup_field
AND mi.rowid <> mo.rowid
)
Both these queries will use quite efficient HASH SEMI JOIN
, the latter will be faster if there is no index on dup_field
.
You may be tempted to copy the rows, but note that much more REDO
and UNDO
information will be generated when copying 2G
rows than when deleting 11M
.
Deleting one duplicate from many is a tricky business, and with that many records, you have a problem.
One option is to turn the problem on its head, and copy the records you want to keep into a new table. You can use CREATE TABLE AS SELECT DISTINCT ... NOLOGGING
syntax, which will copy your de-duplicated records without using the transaction log, which is much faster. Once your new table is populated, delete/rename the old one, and rename the new one into place.
Oh, and remember to slap a UNIQUE index on the new table so this doesn't happen again.
The moral of the story is... never use DELETE for getting rid of large numbers of records, it's horrifyingly slow because it has to store all of the deleted records in the redo log. Either copy-and-switch, or TRUNCATE.
Whether to delete existing rows or create a proper new table and drops the old one is faster depends on a lot of factors. 11 million rows is a lot, but it's only 0.5% of the total number of rows in the table. It's quite possible that the recreate & drop could be much slower than the delete, depending on how many indexes exist on the source table, as well as where the rows that need deleting exist on the data pages.
Then there's the issue of whether the source table is live or not. If there are inserts & updates going on while this cleanup is occurring, the copy & drop isn't going to work without a fair amount of extra code to sync the table up after the fact.
Finally, why is it necessary that this operation be "fast"? Is it because the system needs to be offline while the process is occurring? You could write a procedure that removes the dupes while the system is live, but doesn't impact the rest of the system in terms of consuming undo. We have solved this problem in the past by first writing a query that collects the primary keys of the rows to be removed in a second table, like so:
INSERT
INTO RowsToDeleteTable
SELECT PKColumn
FROM SourceTable
WHERE <conditions used to find rows to remove>
CREATE UNIQUE INDEX PK_RowsToDelete ON RowsToDeleteTable (PKColumn);
Then we have a PL/SQL block that either loops over the rows in a cursor like so:
BEGIN
FOR theRow IN (SELECT PKColumn FROM RowsToDeleteTable ORDER BY 1) LOOP
<delete source table for theRow.PKColumn)
<optionally wait a bit>
commit;
END LOOP;
END;
or does something like this:
BEGIN
FOR theRow IN (SELECT MIN(PKColumn) FROM RowsToDeleteTable ) LOOP
<delete source table for theRow.PKColumn)
<optionally wait a bit>
DELETE RowsToDeleteTable
WHERE PKColumn = theRow.PKColumn;
commit;
END LOOP;
END;
The looping and "SELECT MAX" is obviously less efficient, but it has the advantage of allowing you to follow the progress of the delete operation. We put a bit of wait code in the loop to allow us to control how vigorously the reaping operation occurs.
The initial creation of the RowsToDeleteTable goes very quickly, and you have the advantage of allowing the process to take as long as you want. In case like this, the "holes" left in your extents by the deletes won't be too bad, since you're deleting such a small percentage of the total data.
Thanks guys! I had to copy the unique records into a table, truncate the original table and copy back the unique data.