tags:

views:

113

answers:

2

Often times when I am working on a project, I find my self looking at the database scheme and having to export the data to work with the new scheme.

Lots of times there has been a database where the data stored was fairly crude. What I mean by that is that its stored with lots of unfiltered characters. I find my self writing custom php scripts to filter through this information and create a nice clean UTF-8 CSV file that I then reimport into my new database.

I'd like to know if there are better ways to handle this?

A: 

No one answer to this one, but i once needed to quickly migrate a database and ended up using sqlautocode, which is a tool to autogenerate a (python orm) model from an existing database - the model uses the great sqlalchemy orm library. It even generates some sample code, to get started ... (see below)

Amazingly, it worked out of the box. You do not have a full migration, but an easy way to programmatically access all your tables (in python).

I didn't do it at that project, but you could of course autogenerate your orm layer for the target DB as well, then write a script, which transfers the right rows over into the desired structure.

Once you get your DB content into python, you will be able to deal with u'unicode', even if it will take some attepts, dependent on the actual crudeness ...

Example code:

# some example usage
if __name__ == '__main__':
db = create_engine(u'mysql://username:password@localhost/dbname')
metadata.bind = db

# fetch first 10 items from address_book
s = customers.select().limit(10)
rs = s.execute()
for row in rs:
    print row
The MYYN
A: 

I would suggest using an ETL tool, or at least following ETL practices when moving data. Considering that you are already cleaning, you may follow the whole ECCD path -- extract, clean, conform, deliver. If you do your own cleaning, consider saving intermediate csv files for debug and audit purpose.

1. Extract (as is, junk included) to file_1
2. Clean file_1 --> file_2
3. Conform file_2 --> file_3
4. Deliver file_3 --> DB tables

If you archive files 1-3 and document versions of your scripts, you will be able to backtrack in case of a bug.

ETL tools -- like Microsoft SSIS, Oracle Data Integrator, Pentaho Data Integrator -- connect to various data sources and offer plenty of transformation and profiling tasks.

Damir Sudarevic