views:

694

answers:

4

I am writing a program that does a lot of writes to a Postgres database. In a typical scenario I would be writing say 100,000 rows to a table that's well normalized (three foreign integer keys, the combination of which is the primary key and the index of the table). I am using PreparedStatements and executeBatch(), yet I can only manage to push in say 100k rows in about 70 seconds on my laptop, when the embedded database we're replacing (which has the same foreign key constraints and indices) does it in 10.

I am new at JDBC and I don't expect it to beat a custom embedded DB, but I was hoping it to be only 2-3x slower, not 7x. Anything obvious that I maybe missing? does the order of the writes matter? (i.e. say if it's not the order of the index?). Things to look at to squeeze out a bit more speed?

+1  A: 

You can obviously try to change the size of your batch to find the best size for your configuration, but I doubt that you will gain a factor 3.

You could also try to tune your database structure. You might have better performances when using a single field as a primary key than using a composed PK. Depending on the level of integrity you need, you might save quite some time by deactivating integrity checks on your DB.

You might also change the database you are using. MySQL is supposed to be pretty good for high speed simple inserts ... and I know there is a fork of MySQL around that tries to cut functionalities to get very high performances on highly concurrent access.

Good luck !

Guillaume
+1  A: 

try disabling indexes, and reenabling them after the insert. also, wrap the whole process in a transaction

Javier
+5  A: 

This is an issue that I have had to deal with often on my current project. For our application, insert speed is a critical bottleneck. However, we have discovered for the vast majority of database users, the select speed as their chief bottleneck so you will find that there are more resources dealing with that issue.

So here are a few solutions that we have come up with:

First, all solutions involve using the postgres COPY command. Using COPY to import data into postgres is by far the quickest method available. However, the JDBC driver by default does not currently support COPY accross the network socket. So, if you want to use it you will need to do one of two workarounds:

  1. A JDBC driver patched to support COPY, such as this one.
  2. If the data you are inserting and the database are on the same physical machine, you can write the data out to a file on the filesystem and then use the COPY command to import the data in bulk.

Other options for increasing speed are using JNI to hit the postgres api so you can talk over the unix socket, removing indexes and the pg_bulkload project. However, in the end if you don't implement COPY you will always find performance disappointing.

Elijah
Thanks for the tips; by 'using JNI' do you mean using JNI to access COPY or for issuing regular SQL commands? i.e. do you expect JNI->C->SQL to be faster than JDBC for the same number of INSERTs?
Max Maximus
I have not benchmarked it in postgres, but I believe that it is the strategy that the oracle driver takes. There is a performance overhead when going through tcp versus the unix socket. So in the end, a custom solution for performance may not be worth the effort, so I look at it as a last resort.
Elijah
+1  A: 

Check if your connection is set to autoCommit. If autoCommit is true, then if you have 100 items in the batch when you call executeBatch, it will issue 100 individual commits. That can be a lot slower than calling executingBatch() followed by a single explicit commit().

I would avoid the temptation to drop indexes or foreign keys during the insert. It puts the table in an unusable state while your load is running, since nobody can query the table while the indexes are gone. Plus, it seems harmless enough, but what do you do when you try to re-enable the constraint and it fails because something you didn't expect to happen has happened? An RDBMS has integrity constraints for a reason, and disabling them even "for a little while" is dangerous.