views:

46

answers:

1

Hi,

I have a Python script which uses the MySQLdb interface to load various CSV files into MySQL tables.

In my code, I use Python's standard CSV library to read the CSV, then I insert each field into the table one at a time, using an INSERT query. I do this rather than using LOAD DATA so that I can convert null values and other minor clean-ups on a per-field basis.

Example table format:

`id_number` | `iteration` | `date`     | `value`
102         | 1           | 2010-01-01 | 63
102         | 2           | 2010-01-02 | NULL
102         | 3           | 2010-01-03 | 65

The null value in the second iteration of id_number = 102 represents a case where value hasn't changed from the previous day i.e. value remains 63.

Basically, I need to convert these null values to their correct values. I can imagine 4 ways of doing this:

  1. Once everything is inserted into the table, run a MySQL query that does the iterating and replacing all by itself.

  2. Once everything is inserted into the table, run a MySQL query to send some data back to Python, process in Python then run a MySQL query to update the correct values.

  3. Do the processing in Python on a per-field basis before each insert.

  4. Insert into a temporary table and use SQL to insert into the main table.

I could probably work out how to do #2, and maybe #3, but have no idea how to do #1 or #4, which I think are the best methods as it then requires no fundamental changes to the Python code.

My question is A) which of the above methods is "best" and "cleanest"? (Speed not really an issue.) and B) how would I achieve #1 or #4?

Thanks in advance :)

+2  A: 

I think you would have the most control and the least amount of work with your #3 option, Especially if you want to keep existing values over null values, I think you risk overwriting those with #1.

If speed is not an issue, for every record in your CSV, compare it to the existing record, and update or insert your record with your preferred values.

andyortlieb
I started writing this before Mark's hint, which I have to say is a great suggestion, but I still think this is how you can get it done quickly with the tools that you already know.
andyortlieb
Thanks for the input. That was my original thought, as I'd know how to code that. The thing is, I currently use one definition for inserting dozens of CSV files into the database every day. With #3, I have to change the iteration to "remember" the previous iteration for comparison, which could cause problems with the existing CSV files. This may be the way I have to do it, unless someone can show me a magical SQL query that does it all-in-one!
edanfalls