views:

1137

answers:

4

A few days ago, I ran into an unexpected performance problem with a pretty standard Django setup. For an upcoming feature, we have to regenerate a table hourly, containing about 100k rows of data, 9M on the disk, 10M indexes according to pgAdmin.

The problem is that inserting them by whatever method literally takes ages, up to 3 minutes of 100% disk busy time. That's not something you want on a production site. It doesn't matter if the inserts were in a transaction, issued via plain insert, multi-row insert, COPY FROM or even INSERT INTO t1 SELECT * FROM t2.

After noticing this isn't Django's fault, I followed a trial and error route, and hey, the problem disappeared after dropping all foreign keys! Instead of 3 minutes, the INSERT INTO SELECT FROM took less than a second to execute, which isn't too surprising for a table <= 20M on the disk. What is weird is that PostgreSQL manages to slow down inserts by 180x just by using 3 foreign keys.

Oh, disk activity was pure writing, as everything is cached in RAM; only writes go to the disks. It looks like PostgreSQL is working very hard to touch every row in the referred tables, as 3MB/sec * 180s is way more data than the 20MB this new table takes on disk. No WAL for the 180s case, I was testing in psql directly, in Django, add ~50% overhead for WAL logging. Tried @commit_on_success, same slowness, I had even implemented multi row insert and COPY FROM with psycopg2. That's another weird thing, how can 10M worth of inserts generate > 10x 16M log segments?

Table layout: id serial primary, a bunch of int32, 3 foreign keys to

  • small table, 198 rows, 16k on disk
  • large table, 1.2M rows, 59 data + 89 index MB on disk
  • large table, 2.2M rows, 198 + 210MB

So, am I doomed to either drop the foreign keys manually or use the table in a very un-Django way by defining saving bla_id x3 and skip using models.ForeignKey? I'd love to hear about some magical antidote / pg setting to fix this.

A: 

This seems like normal behavior to me. When bulk inserting into a database, if the table has indexes, foreign keys or triggers, they have to be checked row-by-row. So typically you want to drop them, perform the inserts (using copy if possible), and then recreate indexes, FKs and triggers.

This page on the docs has more details about autocommit, maintenance_work_mem and checkpoint_segments that you can tune: http://www.postgresql.org/docs/8.4/interactive/populate.html

hgimenez
Already seen it. The problem isn't the checking, that's just a O(n log(n)) index lookup on two indices, reading them from memory. No, the problem is the massive amount of writes issued to the disk.Dropping indices didn't cause a measurable increase in performance, and auto commit - I've tried with and without, same result, as stated in the original post.
Miles
I have yet to see a database where dropping FKs does *not* improve performance of bulk inserts. You just need to be smart about dropping and recreating them. It is obviously a great idea to have them.
hgimenez
A: 

Maybe you have a trigger on your table, you do not know of or remember, that fires on every row inserted/deleted. Can you connect to a database using "psql"? If yes, then analyze the output of "\d+ table_name" for all your tables.

You can also dump your database, do import, dump a database again. Compare dumps to check if any other table contents has changed.

Tometzky
psql / pgAdmin shows pretty much the same thing - plain table, 3 FK, 1 PK, 3 Indices. Only dropping the FK's "fixes" the insert performance. It's worrying, as Django promotes using FK's to ensure consistency so there are FK's all over the place, but postgres doesn't handle them too well. I guess I'll try to asking the postgres developers too.
Miles
+1  A: 

100.000 FK checks should take about 2-5 seconds if it doesn't have to wait for IO reads. Much slower than inserting into the table, but much faster than the time you got.

Check that all your foreign keys are INDEXED :

(I'm talking about an index on the referenced column, not the referencing column, got it ?)

If products.category_id REFERENCES category(id), and there is no index on category.id, every time it needs to check a FK it will have to scan the table.

To find which isn't, do your insert with 1 FK, then 2 FKs... you'll find which one is responsible.

And yes, if you truncate the table, it's faster to also drop all constraints and indexes and rebuild them after bulk insertion.

peufeu
I reference only the primary id of other tables - so yes, they are indexed. No, it's no read io: from the Reliability somehow, creating a data table of 9MB takes way longer than expected.
Miles
A: 

I had forgotten that EXPLAIN ANALYZE INSERT INTO bleh ... will show you the timing of all insert triggers.

peufeu
Not for FK's, or are we talking about different versions? It gives: Result (cost=0.00..0.02 rows=1 width=0) (actual time=0.007..0.009 rows=1 loops=1) Total runtime: 32.171 ms, for a simple 1 row insert.
Miles