ansaurus

Question

Algorithm for updating a list from a list

Answer 1

+1 A:

Is there no way to maintain a "last time modified" field? That's what it sounds like you're really looking for: an incremental backup, based on last time backup was run, compared to last time an object was changed/deleted(/added).

John Pirie 2009-06-19 18:25:21

or modified field would be great too!

DoxaLogos 2009-06-19 18:26:46

it would, but I don't the ability to change the CSV data source sadly...

Dan 2009-06-19 23:09:08

Answer 2

A:

When you pull the list into your program, iterate over the list doing a query based on a column property in the database table that maps to the same property of the object from list like ObjectName. Or you could load the whole table into a list and compare the list that way. I assuming that you have something unique about the object that exists besides the ID the database assigns.

If that object is not found in the table via the query, create a new entry. If it is found like FogleBird mentioned, have a computed hash or CRC stored for that object in the table that you can compare with the object in the list(run computation on the object). If the hashes don't match, update that object with the one on the list.

DoxaLogos 2009-06-19 18:26:18

Answer 3

+1 A:

You need to have timestamps in both your database and your CSV file. Timestamp should show the data when the record was updated and you should compare timestamps of the record with same IDs to decide if you need updating it or not

As to your idea about intersection... It should be done vise versa! You have to import all data from CSV to the temporary table and do intersection between 2 SQL database tables. If you use Oracle or MS SQL 2008 (not sure for 2005) you will found a very usefull MERGE keyword, so you can write SQL with less efforts then you will spend for merging data in other programming language.

Bogdan_Ch 2009-06-19 18:33:05

Answer 4

+1 A:

The standard approach for huge piles of data amounts to this.

We'll assume that list_1 is the "master" (without duplicates) and list_2 is the "updates" which may have duplicates.

iter_1 = iter( sorted(list_1) ) # Essentially SELECT...ORDER BY
iter_2 = iter( sorted(list_2) )
eof_1 = False
eof_2 = False
try:
    item_1 = iter_1.next()
except StopIteration:
    eof_1= True
try:
    item_2 = iter_2.next()
except StopIteration:
    eof_2= True
while not eof_1 and not eof_2:
    if item_1 == item_2:
        # do your update to create the new master list.
        try:
            item_2 = iter_2.next()
        except StopIteration:
            eof_2= True
    elif item_1 < item_2:
        try:
            item_1 = iter_1.next()
        except StopIteration:
            eof_1= True
    elif item_2 < item_1:
        # Do your insert to create the new master list.
        try:
            item_2 = iter_2.next()
        except StopIteration:
            eof_2= True
assert eof_1 or eof_2
if eof_1:
    # item_2 and the rest of list_2 are inserts.
elif eof_2:
    pass
else:
    raise Error("What!?!?")

Yes, it involves a potential sort. If list_1 is kept in sorted order when you write it back to the file system, that saves considerable time. If list_2 can be accumulated in a structure that keeps it sorted, then that saves considerable time.

Sorry about the wordiness, but you need to know which iterator raised the StopIteration, so you can't (trivially) wrap the whole while loop in a big-old-try block.

S.Lott 2009-06-19 19:11:13

ansaurus

tags:

views:

answers:

Algorithm for updating a list from a list

related questions