views:

69

answers:

4

I will be writing a little Python script tomorrow, to retrieve all the data from an old MS Access database into a CSV file first, and then after some data cleansing, munging etc, I will import the data into a mySQL database on Linux.

I intend to use pyodbc to make a connection to the MS Access db. I will be running the initial script in a Windows environment.

The db has IIRC well over half a million rows of data. My questions are:

  1. Is the number of records a cause for concern? (i.e. Will I hit some limits)?
  2. Is there a better file format for the transitory data (instead of CSV)?

I chose CSv because it is quite simple and straightforward (and I am a Python newbie) - but I would like to hear from someone who may have done something similar before.

A: 

The only limit should be operating system file size.

That said, make sure when you send the data to the new database, you're writing it a few records at a time; I've seen people do things where they try to load the entire file first, then write it.

Charlie Martin
+1  A: 

I wouldn't bother using an intermediate format. Pulling from Access via ADO and inserting right into MySQL really shouldn't be an issue.

Ignacio Vazquez-Abrams
oh yeah, doing "some data cleansing, munging" on the fly, no worries, it'll "work" first time. **FAIL**
John Machin
If it fails directly, then it would have failed with the intermediary regardless.
Ignacio Vazquez-Abrams
The point is that the multiple attempts at fixing the problems should be better handled with CSV files than in the Access database.
John Machin
@John: I understand that it's accepted doctrine to do so, and I would have said the same a few years ago, but I can't really think of any specific reason why in this case.
Ignacio Vazquez-Abrams
+3  A: 

Yet another approach if you have Access available ...

Create a table in MySQL to hold the data.

In your Access db, create an ODBC link to the MySQL table.

Then execute a query such as:

INSERT INTO MySqlTable (field1, field2, field3)
SELECT field1, field2, field3
FROM AccessTable;

Note: This suggestion presumes you can do your data cleaning operations in Access before sending the data on to MySQL.

HansUp
at worst, you can store the data in a table and then clense the data in a separate pass into another table afterwards once it's all in MySQL
TokenMacGuy
+3  A: 

Memory usage for csvfile.reader and csvfile.writer isn't proportional to the number of records, as long as you iterate correctly and don't try to load the whole file into memory. That's one reason the iterator protocol exists. Similarly, csvfile.writer writes directly to disk; it's not limited by available memory. You can process any number of records with these without memory limitations.

For simple data structures, CSV is fine. It's much easier to get fast, incremental access to CSV than more complicated formats like XML (tip: pulldom is painfully slow).

Glenn Maynard