Synchonizing data between django orm and pyodbc data source

Hello.

I have a database (in Access 95 format... don't ask) that contains data i'd like to use in Django. I can already successfully connect, issue queries, etc using pyodbc. For the most part this is a one way sync operation - that is, the data won't be modified in Django, but will be modifed in the Access 95 db via legacy applications. This is a minor lie - the data will be modified, but only extended DJango-only fields will be touched. My ideal would be using Django's model inheritance to extend the existing Access 95 data, and never write attributes to the 'parent' tables.

I can think of a few ways to use this data in Django (With django's ORM):

Implement a database backend for Django's ORM. While the "ideal solution", this seems rather hairy for simple read only data. Plus, this seems like one of the least backwards compatible options (when upgrading Django), I'm not sure if the framework's API here is stabilized.
Write some code as a batch to loop over the rows in the Access DB, and populate a new Django db and model. This is storing all the data twice, yes, but for ORM operations it should be 'just like' this syncing was never a factor. Its okay for the data to be delayed by up to a day, realtime syncing is not required.

First, i'll describe what i've done already. I implemented method 2 as a new model Manager, which is passed some attributes that more or less maps django field names to the MDB filed names. It also supports some 'coercing' to, eg, replace NULL source values with a default (like the empty string), etc.

So the situation actually gets a bit worse here. It's not just a single DB i'm syncing with, it is potentially many (in my case, currently two). These databases have identical schema, and are similar to the modern concept of sharding - with the important glaring omission that primary keys in the tables are dupes. Perhaps I should explain:

Database1.mdb holds the tables House, Room, and Shelf. House has an id and a name, Room has an id, a name, and a pointer to a House, and Shelf has an id, name, and pointer to a Room.

Database2.mdb holds the same tables, with the same schema as Database1.mdb.

So, the end user is working with Database1.mdb, and does so for a good amount of time. By this time he's created several Houses, Rooms, and Shelfs, lets say 10 shelves per room, 10 rooms, per house, and 10 houses. Later on, he also creates 10 Houses, 10*10 Rooms, and 10*10*10 Shelves in Database2.mdb.

Now my django application is interested in all Shelves (yes, all 2*10*10*10 of them). One could usually treat this as a composite primary key, but Django's support for that isn't actually existent. So, if i'm going to be using method 2 (nightly sync to local tables), then i'll need a local id column, and a way of synchronizing the old composite key (database name + id number) to the new id key.

So "all of this works already". Kind of.

class Library(models.Model):
    name = models.CharField(max_length=255, unique=True)

class Option(models.Model):
    objects = MDB_Manager('Library', 'OptionLabels', { #The first param is a custom pyodbc class that does simple stuff like 'select all rows' etc
        'lib_optionlabel_id': 'ID',
        'name': 'Name',
        'description': ('Description', lambda x: x or ''),
        },
    'lib_optionlabel_id', #local attr holding mdb pk
    ('library_id', ),     #required attributes when syncing, to find the local row (the other half of the composite key)
    lambda manager, extradict: (Library.objects.get(pk=extradict['library_id']).name, ),
    )
    library = models.ForeignKey(Library)
    lib_option_id = models.IntegerField()
    name = models.CharField(max_length=255)
    description = models.CharField(max_length=255)
    new_attr_not_in_mdb = models.IntegerField(default=0)

This lets me sync tables like this: Option.objects.update_from_mdb({'library_id': 1}) which works.

However, this gets a bit... hairy. Quickly.

But even still, it is a solution. Except that updating from the mdb table looks kind of like:

    for row in k.table[self.mdb_table].select(): #This uses pyodbc's generator
        i = self.model() #instantiate the local version
        try:
            i = self.get(**searchdict) #try to fing the existing row - searchdict holds the composite key of 'library' and 'lib_option_id' in the above example
        except:
            pass
        #at this point, i now holds the local data.
        #a loop uses setattr() to set the local stuff
        i.save() #Save the updated data.

And again, this is fine, and it works... except that it is slow as hell, as it's a local db hit for each .get() with the composite key... which is kind of a pain when dealing with 100,000 rows.

So, my question is twofold:

How would you do this?
Any way I can speed this up, if this is 'the best way'?

Stipulations / Constraints

The access DB will never be modified by Django
django-pyodbc seems like it will only work for SQL Server, and not Access's 'friendly' dialect of SQL
A lag time of a day or so between sync from MDB to Django is acceptable
Reading from the DB directly would be fine If I could use Django's ORM
- Except that thing about the composite keys...

Thanks for the help :)

ansaurus

tags:

views:

answers:

Synchonizing data between django orm and pyodbc data source

related questions