views:

339

answers:

4

I have some large tables (millions of rows). I constantly receive files containing new rows to add in to those tables - up to 50 million rows per day. Around 0.1% of the rows I receive are duplicates of rows I have already loaded (or are duplicates within the files). I would like to prevent those rows being loaded in to the table.

I currently use SQL*Loader in order to have sufficient performance to cope with my large data volume. If I take the obvious step and add a unique index on the columns which goven whether or not a row is a duplicate, SQL*Loader will start to fail the entire file which contains the duplicate row - whereas I only want to prevent the duplicate row itself being loaded.

I know that in SQL Server and Sybase I can create a unique index with the 'Ignore Duplicates' property and that if I then use BCP the duplicate rows (as defined by that index) will simply not be loaded.

Is there some way to achieve the same effect in Oracle?

I do not want to remove the duplicate rows once they have been loaded - it's important to me that they should never be loaded in the first place.

+2  A: 

Use Oracle MERGE statement. Some explanations here.

Cătălin Pitiș
+2  A: 

You dint inform about what release of Oracle you have. Have a look at there for merge command.

Basically like this

---- Loop through all the rows from a record temp_emp_rec
MERGE INTO hr.employees e
     USING temp_emp_rec t
     ON (e.emp_ID = t.emp_ID)
     WHEN MATCHED THEN
    --- _You can update_
    UPDATE
     SET first_name = t.first_name,
          last_name = t.last_name
    --- _Insert into the table_
    WHEN NOT MATCHED THEN
    INSERT (emp_id, first_name, last_name)
    VALUES (t.emp_id, t.first_name, t.last_name);
Guru
+5  A: 

What do you mean by "duplicate"? If you have a column which defines a unique row you should setup a unique constraint against that column. One typically creates a unique index on this column, which will automatically setup the constraint.

EDIT: Yes, as commented below you should setup a "bad" file for SQL*Loader to capture invalid rows. But I think that establishing the unique index is probably a good idea from a data-integrity standpoint.

Adam Hawkes
A very good point - I should have mentioned though that I am loading up to 50 million rows per day and therefore want to use SQL*Loader to carry out the data load. I believe that SQL*Loader will fail the entire file if it contains duplicates which violate a unique index.
You can tell SQL*Loader what to do with rejected rows. Try specifying a 'badfile' parameter on the command line, with a suitably high 'errors' parameter.
Hobo
@Adam - sorry, that was directed at ginsoakedboy, not you. I reckon a combination of a unique index and suitable SQL*Loader parameters is the way to go.
Hobo
Thanks - I will try that out
+1  A: 

I would use integrity constraints defined on the appropriate table columns.

This page from the Oracle concepts manual gives an overview, if you also scroll down you will see what types of constraints are available.

carpenteri
A good approach to be sure, but in order to meet my performance needs (50 million rows/day) I am using SQL*Loader to load the rows. I think that SQL*Loader will fail entire files if they contain duplicates if I add such an index, which isn't acceptable for my application.