views:

1234

answers:

8

Hello,

I have a database table with a unique string field and a couple of integer fields. The string field is usually 10-100 characters long.

Once every minute or so I have the following scenario: I receive a list of 2-10 thousand tuples corresponding to the table's record structure, e.g.

[("hello", 3, 4), ("cat", 5, 3), ...]

I need to insert all these tuples to the table (assume I verified neither of these strings appear in the database). For clarification, I'm using InnoDB, and I have an auto-incremental primary key for this table, the string is not the PK.

My code currently iterates through this list, for each tuple creates a Python module object with the appropriate values, and calls ".save()", something like so:

@transaction.commit_on_success
def save_data_elements(input_list):
    for (s, i1, i2) in input_list:
        entry = DataElement(string=s, number1=i1, number2=i2)
        entry.save()

This code is currently one of the performance bottlenecks in my system, so I'm looking for ways to optimize it.

For example, I could generate SQL codes each containing an INSERT command for 100 tuples ("hard-coded" into the SQL) and execute it, but I don't know if it will improve anything.

Do you have any suggestion to optimize such a process?

Thanks

+1  A: 

what format do you receive? if it is a file, you can do some sort of bulk load: http://www.classes.cs.uchicago.edu/archive/2005/fall/23500-1/mysql-load.html

KM
+10  A: 

For MySQL specifically, the fastest way to load data is using LOAD DATA INFILE, so if you could convert the data into the format that expects, it'll probably be the fastest way to get it into the table.

Chad Birch
The only potential issue is overriding the save() method. If you do this, you'll have to think twice about your design.
S.Lott
@S.Lott: what do you mean by "overriding the save()"? Do you mean whether I override the .save() method in the module class so that there are pre/post processing tasks taking place while saving through the code that will be lost in the "load data infile"? If so - that's not the case, I'm not overriding .save(). Otherwise please elaborate... Thanks
Roee Adler
+8  A: 

You can write the rows to a file in the format "field1", "field2", .. and then use LOAD DATA to load them

data = '\n'.join(','.join('"%s"' % field for field in row) for row in data)
f= open('data.txt', 'w')
f.write(data)
f.close()

Then execute this:

LOAD DATA INFILE 'data.txt' INTO TABLE db2.my_table;

Reference

Nadia Alramli
It would need to be LOAD DATA LOCAL INFILE unless the code is running on the database server.
staticsan
Also, disable indexes before loading, and then enable them afterwards (will take a while to build the index). Haven't looked whether it helps with Django inserts too.
pufferfish
+3  A: 

If you don't LOAD DATA INFILE as some of the other suggestions mention, two things you can do to speed up your inserts are :

  1. Use prepared statements - this cuts out the overhead of parsing the SQL for every insert
  2. Do all of your inserts in a single transaction - this would require using a DB engine that supports transactions (like InnoDB)
Sean McSomething
@Sean: Thanks, by "prepared statements" do you mean SQL code with many %s elements that I just "fill" by providing the list of strings/numbers? Also, please take a look at my code (in the body of the question) - if I understand correctly I'm already using a single transaction with the @transaction.commit_on_success decorator (I'm using InnoDB)
Roee Adler
I'm not really sure what's going on behind the scenese with Django - I'm just coming from a generic background of using MySQL so I don't know what that's doing regarding transactions. As for prepared statements - it looks like that's an implementation detail of your DataElement objects. A prepared statement would be : `stmt = Prepare(sqlStatement); stmt.execute(var1, var2..)` rather than `db.execute(sqlStatement, var1, var2...)` - it's like compiling regular expressions rather than parsing them every time.
Sean McSomething
+2  A: 

If you can do a hand-rolled INSERT statement, then that's the way I'd go. A single INSERT statement with multiple value clauses is much much faster than lots of individual INSERT statements.

staticsan
@staticsan: Do you think there's any "practical" limitation to such a statement? i.e. can I send the database a single INSERT query with 10k lines of text?
Roee Adler
The only real limitation is the size of the network buffer. The default value of this was 1Mb for many years, but many people raised it to the maximum of 16Mb. More recent versions of MySQL can support even large packet sizes.
staticsan
A: 

This is unrelated to the actual load of data into the DB, but...

If providing a "The data is loading... The load will be done shortly" type of message to the user is an option, then you can run the INSERTs or LOAD DATA asynchronously in a different thread.

Just something else to consider.

NathanD
More likely the problem is that the server is so busy processing this input it cannot handle any other requests.
Tom Leys
I'm already doing the processing at a separate thread (the user is not waiting for this to finish), my problem is that at times the system is too busy so there's a chance the queue will fill faster than its cleanup for just enough time...
Roee Adler
+1  A: 

Regardless of the insert method, you will want to use the InnoDB engine for maximum read/write concurrency. MyISAM will lock the entire table for the duration of the insert whereas InnoDB (under most circumstances) will only lock the affected rows, allowing SELECT statements to proceed.

weevilgenius
Thanks, I added a clarification that I'm using InnoDB
Roee Adler
+1  A: 

I donot know the exact details, but u can use json style data representation and use it as fixtures or something. I saw something similar on Django Video Workshop by Douglas Napoleone. See the videos at http://www.linux-magazine.com/online/news/django_video_workshop. and http://www.linux-magazine.com/online/features/django_reloaded_workshop_part_1. Hope this one helps.

Hope you can work it out. I just started learning django, so I can just point you to resources.

roopesh