views:

255

answers:

3

I'm new to database development so hopefully this is trivial.

I have a large collection of raw data (around 300million rows) with about 10% replicated data. I need to get the data into a database. For the sake of performance I'm trying to use SQL copy. The problem being when I commit the data, primary key exceptions prevent any of the data from being processed. Can I change the behavior of primary keys such that conflicting data is simply ignored, or replaced? I don't really care either way - I just need one unique copy of each of the data.

Thanks alot.

+2  A: 

I think your best bet would be to drop the constraint, load the data, then clean it up and reapply the constraint.

rjrapson
A: 

That's what I was considering doing, but was worried about performance of getting rid of 30million randomly placed rows in a 300million entry database. The duplicate data also has a spatial relationship which is why I wanted to try to fix the problem while loading the data rather than after I have it all loaded.

A: 

Use a select statement to select exactly the data you want to insert, without the duplicates.

Use that as a basis of a CREATE TABLE XYZ AS SELECT * FROM (query-just-non-dupes)

You might check out ASKTOM ideas on how to select the non-duplicate rows

EvilTeach