views:

90

answers:

2

Hi folks,

I'm using Python in order to save the data row by row... but this is extremely slow!

The CSV contains 70million lines, and with my script I can just store 1thousand a second.


This is what my script looks like

reader = csv.reader(open('test_results.csv', 'r'))
for row in reader:
    TestResult(type=row[0], name=row[1], result=row[2]).save()

I reckon that for testing I might have to consider MySQL or PostgreSQL.

Any idea or tips? This is the first time I deal with such massive volumes of data. :)

+3  A: 

For MySQL imports:

mysqlimport [options] db_name textfile1 [textfile2 ...]

For SQLite3 imports:

ref http://stackoverflow.com/questions/1045910/how-can-i-import-load-a-sql-or-csv-file-into-sqlite

Andrew Sledge
+1  A: 

I don't know if this will make a big enough difference, but since you're dealing with the Django ORM I can suggest the following:

  1. Ensure that DEBUG is False in your Django settings file, since otherwise you're storing every single query in memory.
  2. Put your logic in a main function, and wrap that in the django.db.transactions.commit_on_success decorator. That will prevent each row from needing it's own transaction, which will substantially speed up the process.
  3. If you know that all of the rows in the file do not exist in the database, add force_insert=True to your call to the save() method. This will halve the number of calls to sqlite needed.

These suggestions will probably make an even bigger difference if you do find yourself using a client-server DBMS.

Peter Russell