ansaurus

Question

Getting record differences between 2 nearly identical tables

Answer 1

+2 A:

Use timestamps or better yet audit tables to identify the records that changed since time "X" and then save time "X" when last sync started. We use that for interface feeds.

DVK 2009-10-05 18:45:15

Are these things that you've added to the database/tables? IE: added a last_updated field to each table, and an audit_table to each database? Unfortunately, I can't modify the schema of the source databases as they come from a vendor product.

aasukisuki 2009-10-05 18:49:16

Can you add an audit table?

DVK 2009-10-05 18:59:07

Technically I could add an audit table or even a field to each table to be used as a timestamp, but the vendor process will never add anything to the audit table, or populate the timestamp field on change.

aasukisuki 2009-10-05 19:24:49

Does the table have a non-decreasing field? (DB generated or naturally occuring ID)? If not, it's posssible to populate an audit table from periodically running SP but devilishly yucky.

DVK 2009-10-05 22:28:36

Also, see if you can somehow plug into the vendor process "un-oficially". E.g. may be there's some left over sludge from it in some file-system or DB listing rows that were just (or since 5 mins ago) acted on.

DVK 2009-10-05 22:30:06

Answer 2

A:

You might want to try LEFT JOIN with NULL filter:

DELETE      <table> 
FROM        <table> t
LEFT JOIN   <source> b 
        ON (t.Field1 = b.Field1 OR (t.Field1 IS NULL AND b.Field1 IS NULL))
        AND(t.Field2 = b.Field2 OR (t.Field2 IS NULL AND b.Field2 IS NULL))
        --//...
WHERE       t.PROJECT_ID = <project_id>
        AND b.PrimaryKey IS NULL --// any of the PK fields will do, but I really hope you do not use composite PKs

But if you are comparing all non-PK columns, then your query is going to suffer.

In this case it is better to add a UpdatedAt TIMESTAMP field (as DVK suggests) on both databases which you could update with the AFTER UPDATE trigger, then your sync procedure would be much faster, given that you create an index including PKs and UpdatedAt column.

van 2009-10-05 19:00:25

Answer 3

A:

You can reorder the WHERE statement; it has four comparisons, put the one most likely to fail first.

If you can alter the databases/application slightly, and you'll need to do this again, a bit field that says "updated" might not be a bad addition.

Dean J 2009-10-05 19:08:32

Answer 4

A:

I usually rewrite queries like this to avoid the not... Not In is horrible for performance, although Not Exists improves on this.

Check out this article, http://www.sql-server-pro.com/sql-where-clause-optimization.html

My suggestion...

Select out your pkey column into a working/temp table, add a column (flag) int default 0 not null, and index the pkey column. Mark flag =1 if record exists in your subquery (much quicker!). Replace your sub select in your main query with an exists where (select pkey from temptable where flag=0)

What this works out to is being able to create a list of 'not exists' values that can be used inclusively from an all inclusive set.

Here's our total set. {1,2,3,4,5}

Here's the existing set {1,3,4}

We create our working table from these two sets (technically a left outer join) (record:exists)

{1:1, 2:0, 3:1, 4:1, 5:0}

Our set of 'not existing records'

{2,5} (Select * from where flag=0)

Our product... and much quicker (indexes!)

{1,2,3,4,5} in {2,5} = {2,5}

{1,2,3,4,5} not in {1,3,4} = {2,5}

This can be done without a working table, but its use makes visualizing what's happening easier.

Kris

KSimons 2009-10-05 19:27:20

ansaurus

tags:

views:

answers:

Getting record differences between 2 nearly identical tables

related questions