views:

104

answers:

4

I have a process that consolidates 40+ identically structured databases down to one consolidated database, the only difference being that the consolidated database adds a project_id field to each table.

In order to be as efficient as possible, I'm try to only copy/update a record from the source databases to the consolidated database if it's been added/changed. I delete outdated records from the consolidated database, and then copy in any non-existing records. To delete outdated/changed records I'm using a query similar to this:

DELETE FROM <table> 
 WHERE NOT EXISTS (SELECT <primary keys> 
                     FROM <source> b 
                    WHERE ((<b.fields = a.fields>) or 
                          (b.fields is null and a.fields is null))) 
  AND PROJECT_ID = <project_id>

This works for the most part, but one of the tables in the source database has over 700,000 records, and this query takes over an hour to complete.

How can make this query more efficient?

+2  A: 

Use timestamps or better yet audit tables to identify the records that changed since time "X" and then save time "X" when last sync started. We use that for interface feeds.

DVK
Are these things that you've added to the database/tables? IE: added a last_updated field to each table, and an audit_table to each database? Unfortunately, I can't modify the schema of the source databases as they come from a vendor product.
aasukisuki
Can you add an audit table?
DVK
Technically I could add an audit table or even a field to each table to be used as a timestamp, but the vendor process will never add anything to the audit table, or populate the timestamp field on change.
aasukisuki
Does the table have a non-decreasing field? (DB generated or naturally occuring ID)? If not, it's posssible to populate an audit table from periodically running SP but devilishly yucky.
DVK
Also, see if you can somehow plug into the vendor process "un-oficially". E.g. may be there's some left over sludge from it in some file-system or DB listing rows that were just (or since 5 mins ago) acted on.
DVK
A: 

You might want to try LEFT JOIN with NULL filter:

DELETE      <table> 
FROM        <table> t
LEFT JOIN   <source> b 
        ON (t.Field1 = b.Field1 OR (t.Field1 IS NULL AND b.Field1 IS NULL))
        AND(t.Field2 = b.Field2 OR (t.Field2 IS NULL AND b.Field2 IS NULL))
        --//...
WHERE       t.PROJECT_ID = <project_id>
        AND b.PrimaryKey IS NULL --// any of the PK fields will do, but I really hope you do not use composite PKs

But if you are comparing all non-PK columns, then your query is going to suffer.

In this case it is better to add a UpdatedAt TIMESTAMP field (as DVK suggests) on both databases which you could update with the AFTER UPDATE trigger, then your sync procedure would be much faster, given that you create an index including PKs and UpdatedAt column.

van
A: 

You can reorder the WHERE statement; it has four comparisons, put the one most likely to fail first.

If you can alter the databases/application slightly, and you'll need to do this again, a bit field that says "updated" might not be a bad addition.

Dean J
A: 

I usually rewrite queries like this to avoid the not... Not In is horrible for performance, although Not Exists improves on this.

Check out this article, http://www.sql-server-pro.com/sql-where-clause-optimization.html

My suggestion...

Select out your pkey column into a working/temp table, add a column (flag) int default 0 not null, and index the pkey column. Mark flag =1 if record exists in your subquery (much quicker!). Replace your sub select in your main query with an exists where (select pkey from temptable where flag=0)

What this works out to is being able to create a list of 'not exists' values that can be used inclusively from an all inclusive set.

Here's our total set. {1,2,3,4,5}

Here's the existing set {1,3,4}

We create our working table from these two sets (technically a left outer join) (record:exists)

{1:1, 2:0, 3:1, 4:1, 5:0}

Our set of 'not existing records'

{2,5} (Select * from where flag=0)

Our product... and much quicker (indexes!)

{1,2,3,4,5} in {2,5} = {2,5}

{1,2,3,4,5} not in {1,3,4} = {2,5}

This can be done without a working table, but its use makes visualizing what's happening easier.

Kris

KSimons