views:

144

answers:

4

I've got a data source that provides a list of objects and their properties (a CSV file, but that doesn't matter). Each time my program runs, it needs to pull a new copy of the list of objects, compare it to the list of objects (and their properties) stored in the database, and update the database as needed.

Dealing with new objects is easy - the data source gives each object a sequential ID number, check the top ID number in the new information against the database, and you're done. I'm looking for suggestions for the other cases - when some of an object's properties have changed, or when an object has been deleted.

A naive solution would be to pull all the objects from the database and get the complement of the intersection of the two sets (old and new) and then examine those results, but that seems like it wouldn't be very efficient if the sets get large. Any ideas?

+1  A: 

Is there no way to maintain a "last time modified" field? That's what it sounds like you're really looking for: an incremental backup, based on last time backup was run, compared to last time an object was changed/deleted(/added).

John Pirie
or modified field would be great too!
DoxaLogos
it would, but I don't the ability to change the CSV data source sadly...
Dan
A: 

When you pull the list into your program, iterate over the list doing a query based on a column property in the database table that maps to the same property of the object from list like ObjectName. Or you could load the whole table into a list and compare the list that way. I assuming that you have something unique about the object that exists besides the ID the database assigns.

If that object is not found in the table via the query, create a new entry. If it is found like FogleBird mentioned, have a computed hash or CRC stored for that object in the table that you can compare with the object in the list(run computation on the object). If the hashes don't match, update that object with the one on the list.

DoxaLogos
+1  A: 

You need to have timestamps in both your database and your CSV file. Timestamp should show the data when the record was updated and you should compare timestamps of the record with same IDs to decide if you need updating it or not

As to your idea about intersection... It should be done vise versa! You have to import all data from CSV to the temporary table and do intersection between 2 SQL database tables. If you use Oracle or MS SQL 2008 (not sure for 2005) you will found a very usefull MERGE keyword, so you can write SQL with less efforts then you will spend for merging data in other programming language.

Bogdan_Ch
+1  A: 

The standard approach for huge piles of data amounts to this.

We'll assume that list_1 is the "master" (without duplicates) and list_2 is the "updates" which may have duplicates.

iter_1 = iter( sorted(list_1) ) # Essentially SELECT...ORDER BY
iter_2 = iter( sorted(list_2) )
eof_1 = False
eof_2 = False
try:
    item_1 = iter_1.next()
except StopIteration:
    eof_1= True
try:
    item_2 = iter_2.next()
except StopIteration:
    eof_2= True
while not eof_1 and not eof_2:
    if item_1 == item_2:
        # do your update to create the new master list.
        try:
            item_2 = iter_2.next()
        except StopIteration:
            eof_2= True
    elif item_1 < item_2:
        try:
            item_1 = iter_1.next()
        except StopIteration:
            eof_1= True
    elif item_2 < item_1:
        # Do your insert to create the new master list.
        try:
            item_2 = iter_2.next()
        except StopIteration:
            eof_2= True
assert eof_1 or eof_2
if eof_1:
    # item_2 and the rest of list_2 are inserts.
elif eof_2:
    pass
else:
    raise Error("What!?!?")

Yes, it involves a potential sort. If list_1 is kept in sorted order when you write it back to the file system, that saves considerable time. If list_2 can be accumulated in a structure that keeps it sorted, then that saves considerable time.

Sorry about the wordiness, but you need to know which iterator raised the StopIteration, so you can't (trivially) wrap the whole while loop in a big-old-try block.

S.Lott