views:

11

answers:

1

Hi,

I currently have a table of 3m records that needs updating nightly. The data that populates this table comes from ~100 APIs that all get normalised into one jumbo table.

Problem: How to reflect new records being added, and records being deleted at the source?

Facts: I can't truncate the table every night and reinsert. Each API provides a constant ID for each record (so I can keep track of what's what). Some fields will be updated each night.

Solutions: New records are easy, I just add them to my table with an AvailableFrom date. Updates are also easy, for each record I check if it exists and if data has changed (performance will suck).

Deleted records are where I'm stuck. The APIs just dump me a load of data, how do I tell if a record has "dropped off"?

I'm thinking a swap table of some sort - any ideas?

A: 

If the only way to tell whether a record has been deleted is to check whether the api delivers it any more without knowing what record exactly you are looking for you will need to keep track on the iports. If you always do a full import:

Solution 1:

set a flag for for every row in database, then do the import and update the flag for every row you get, then delete everything that hasn't been updated.

Solution 2: Set an import ID (bound to the date?) for every import and write it to the database entries. so you know which row originates from which import. Overriding existing data with the import id from the latest import. Then you can work only with the data from the last import.

but if you always do a full import, dropping everything before should be faster shouldnt it?

Joe Hopfgartner
Thanks, I went for the 1st solution. Solution 2 is probably neater (I could create Import objects with stats etc) but instead I added a LastUpdated column. A separate process then purges old rows. Full import would destroy the table IDs that are referenced elsewhere.
jimi