tags:

views:

103

answers:

4

What would be the best way to import multi-million record csv files into django.

Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.

Can this process be achieved to execute in few hours.

Can memcache be used somehow.

Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

A: 

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Shane Reustle
+1  A: 

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than 1 days to import.

Since our customer was totally unsatisfied (with the danger of loosing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

A. Ionescu
can you point some web resources for Java and JDBC
bobsr
Well, there are just too many good books and resources about both. E.g. free, e.g. http://java.sun.com/developer/Books/JDBCTutorial/ . Right now I have here near my keyboard http://apress.com/book/view/9781590595206 and http://oreilly.com/catalog/9780596005221/ for JDBC. For the actual project we're using http://supercsv.sourceforge.net/ because we need to do some more transformations first, and this is simpler for us if the CSV is red as objects instead of strings before pushing it to the database.
A. Ionescu
A: 

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.

As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.

If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).

And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

Craig Trader
A: 

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.

Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Parand
can mysql LOAD command work with django ManyToManyField fields
bobsr