tags:

views:

446

answers:

4

I've got an application that parses log files and inserts a huge amount of data into database. It's written in Java and talks to a MySQL database over JDBC. I've experimented with different ways to insert the data to find the fastest for my particular use case. The one that currently seems to be the best performer is to issue an extended insert (e.g. a single insert with multiple rows), like this:

INSERT INTO the_table (col1, col2, ..., colN) VALUES
(v1, v2, v3, ..., vN),
(v1, v2, v3, ..., vN),
...,
(v1, v2, v3, ..., vN);

The number of rows can be tens of thousands.

I've tried using prepared statements, but it's nowhere near as fast, probably because each insert is still sent to the DB separately and the tables needs to be locked and whatnot. My colleague who worked on the code before me tried using batching, but that didn't perform well enough either.

The problem is that using extended inserts means that as far as I can tell I need to build the SQL string myself (since the number of rows is variable) and that means that I open up all sorts of SQL injection vectors that I'm no where intelligent enough to find myself. There's got to be a better way to do this.

Obviously I escape the strings I insert, but only with something like str.replace("\"", "\\\""); (repeated for ', ? and \), but I'm sure that isn't enough.

+4  A: 

prepared statements + batch insert:

PreparedStatement stmt = con.prepareStatement(
"INSERT INTO employees VALUES (?, ?)");

stmt.setInt(1, 101);
stmt.setString(2, "Paolo Rossi");
stmt.addBatch();

stmt.setInt(1, 102);
stmt.setString(2, "Franco Bianchi");
stmt.addBatch();

// as many as you want   
stmt.executeBatch();
dfa
Would that issue one insert, or multiple? Or does the batching make the overhead of issuing many prepared statements go away?
Theo
batching should minimize the overhead of a the prepared statement
dfa
+1 for actual code
Thorbjørn Ravn Andersen
Sorry, my colleague informed me that batching didn't solve the problem (I've edited the question to include this).
Theo
can you provide the actual code?
dfa
however this imho is the only viable way to avoid SQL injection AND provide decent performance
dfa
I'd be very surprised if batching didn't perform well enough. For large datasets you need to tune your batches a bit. For example, the batch can't be infinitely large, or you'll run out of memory. However, your batches need to be large nough to get the performance you need. Try increasing the batch size to larger values. But don't just take your colleague's word for it, try it for yourself.
skaffman
+1  A: 

I would try batching your inserts and see how that performs.

Have a read of this (http://www.onjava.com/pub/a/onjava/excerpt/javaentnut_2/index3.html?page=2) for more information on batching.

Nick Holt
Batching seems to be a solution, but according to my colleague who worked on the code before me it doesn't compare to using extended inserts (I've edited the question to include this).
Theo
Don't know what your colleague did when they tried to batch these inserts but using batching isn't as simple as looping over all the data you want to insert, adding all to one big batch and calling executeBatch. Here's a post from the MySQL mailing list (http://lists.mysql.com/maxdb/14236) regarding batching. In the case you've described you'd certainly want to call executeBatch more than once.
Nick Holt
Fair enough, it looks like batching is certainly worth trying again, but if what he says is true (and it seems he never commited the code, just tried it and discarded) there's a huge difference in performance between extended inserts and batching. I guess it all comes down to where the bottleneck is: is it sending the data to the database that is slow, or is it locking and all that inside the database that is the issue. If it's the latter batching doesn't solve the problem, if it's the former it may perform just as good and I get more security.
Theo
Suspect it'll probably be a bit of both (sending the data and the database itself). I'd play with different batch sizes and look at the table structure and it's indexes.
Nick Holt
+1  A: 

If you are loading tens of thousands of records then you're probably better off using a bulk loader.

http://dev.mysql.com/doc/refman/5.0/en/load-data.html

pjp
-1 a bulk loader is useful only if you have static data
dfa
@dfa - what do you mean by static data?
pjp
I think it's a fair enough suggestion. Writing CSV files and using LOAD DATA INFILE can be really, really fast in my experience. It's a bit more complex since it involves writing the CSV files and making sure MySQL can find them, though.
Theo
Using a similar utility in SQL Server (bcp) I managed to reduce my insert time from around 50 seconds to around 3 for around 25k rows.
pjp
+1  A: 

Regarding the difference between extended inserts and batching single inserts, the reason I decided to use extended inserts is because I noticed that it took my code alot longer time to insert alot of rows than mysql does from the terminal. This was even though I was batching inserts in batches of 5000. The solution in the end was to use extended inserts.

I quickly retested this theory.

I took two dumps of a table with 1.2 million rows. One using the default extended insert statements you get with mysqldump and the other using:

mysqldump --skip-extended-insert

Then I simply imported the files again into new tables and timed it.

The extended insert test finished in 1m35s and the other in 3m49s.

Daniel
That doesn't take prepared statements into account though. To make a fair comparison you need to load the data into an application that first prepared the INSERT statement, added each row in a batch and ran it. I still think you are right in the end, but this doesn't actually say if prepared statements + batching would be as fast as extended inserts.
Theo