views:

1312

answers:

2

I want to load a CSV file that looks like this:

Acct. No.,1-15 Days,16-30 Days,31-60 Days,61-90 Days,91-120 Days,Beyond 120 Days
2314134101,898.89,8372.16,5584.23,7744.41,9846.54,2896.25
2414134128,5457.61,7488.26,9594.02,6234.78,273.7,2356.13
2513918869,2059.59,7578.59,9395.51,7159.15,5827.48,3041.62
1687950783,4846.85,8364.22,9892.55,7213.45,8815.33,7603.4
2764856043,5250.11,9946.49,8042.03,6058.64,9194.78,8296.2
2865446086,596.22,7670.04,8564.08,3263.85,9662.46,7027.22
,4725.99,1336.24,9356.03,1572.81,4942.11,6088.94
,8248.47,956.81,8713.06,2589.14,5316.68,1543.67
,538.22,1473.91,3292.09,6843.89,2687.07,9808.05
,9885.85,2730.72,6876,8024.47,1196.87,1655.29

But if you notice, some of the fields are incomplete. I'm thinking MySQL will just skip the row where the first column is missing. When I run the command:

LOAD DATA LOCAL INFILE 'test-long.csv' REPLACE INTO TABLE accounts
    FIELDS TERMINATED BY ',' LINES TERMINATED BY '\r\n'
    IGNORE 1 LINES
    (cf_535, cf_580, cf_568, cf_569, cf_571, cf_572);

And the MySQL output is:

Query OK, 41898 rows affected, 20948 warnings (0.78 sec)
Records: 20949  Deleted: 20949  Skipped: 0  Warnings: 20948

The number of lines is only 20,949 but MySQL reports it as 41,898 rows affected. Why so? Also, nothing really changed in the table. I also couldn't see what the warnings generated is all about. I wanted to use the LOAD DATA INFILE because it takes python half a second to update each row which translates to 2.77 hours for a file with 20,000+ records.

UPDATE: Modified the code to set auto-commit to 'False' and added a db.commit() statement:

# Tell MySQLdb to turn off auto-commit
db.autocommit(False) 

# Set count to 1
count = 1
while count < len(contents):
    if contents[count][0] != '':
     cursor.execute("""
      UPDATE accounts SET cf_580 = %s, cf_568 = %s, cf_569 = %s, cf_571 = %s, cf_572 = %s
      WHERE cf_535 = %s""" % (contents[count][1], contents[count][2], contents[count][3], contents[count][4], contents[count][5], contents[count][0]))
    count += 1

try:
    db.commit()
except:
    db.rollback()
+1  A: 

You have basically 3 issues here. In reverse order

  1. Are you doing your Python inserts in individual statements? You probably want to surround them all with a begin transaction/commit. 20,000 commits could easily take hours.
  2. Your import statement defines 6 fields, but the CSV has 7 fields. That would explain the double row count: every line of input results in 2 rows in the database, the 2nd one with fields 2-6 null.
  3. Incomplete rows will be inserted with null or default values for the missing columns. This may not be what you want with those malformed rows.

If your python program can't perform fast enough even with a single transaction, you should at least have the python program edit/clean the data file before importing. If Acct. No. is the primary key, as seems reasonable, inserting rows with blank will either cause the whole import to fail, or if auto number is on, cause bogus data to be imported.

Lucky
1. Yes, I'm doing Python inserts in individual statements since the table is MyISAM and doesn't support transactions.2. Thank you for pointing this out. I have overlooked this one.I think I have to explain further the end result that I wanted. I wanted to update the DB by matching the account number first and then updating the values of that account number. This is going to be done on a regular basis. However, from the looks of it, LOAD DATA INFILE is good for initial imports and not for regular updating.
Francis
Converted MyISAM table to InnoDB (forgot that InnoDB was skipped/not loaded on my local development machine).
Francis
A: 

If you use REPLACE keyword in LOAD DATA, then number after "Deleted: " shows how many rows were actually replaced

noonex