views:

1889

answers:

5

Hi again,

I am currently analyzing a wikipedia dump file; I am extracting a bunch of data from it using python and persisting it into a PostgreSQL db. I am always trying to make things go faster for this file is huge (18GB). In order to interface with PostgreSQL, I am using psycopg2, but this module seems to mimic many other such DBAPIs.

Anyway, I have a question concerning cursor.executemany(command, values); it seems to me like executing an executemany once every 1000 values or so is better than calling cursor.execute(command % value) for each of these 5 million values (please confirm or correct me!).

But, you see, I am using an executemany to INSERT 1000 rows into a table which has a UNIQUE integrity constraint; this constraint is not verified in python beforehand, for this would either require me to SELECT all the time (this seems counter productive) or require me to get more than 3 GB of RAM. All this to say that I count on Postgres to warn me when my script tried to INSERT an already existing row via catching the psycopg2.DatabaseError.

When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one.

Since psycopg2 is so poorly documented (as are so many great modules...), I cannot find an efficient and effective workaround. I have reduced the number of values INSERTed per executemany from 1000 to 100 in order to reduce the likeliness of a non-UNIQUE INSERT per executemany, but I am pretty certain their is a way to just tell psycopg2 to ignore these execeptions or to tell the cursor to continue the executemany.

Basically, this seems like the kind of problem which has a solution so easy and popular, that all I can do is ask in order to learn about it.

Thanks again!

A: 

Could this be a solution to your problem:

  1. remove the unique index from the table that you are inserting into.
  2. insert all data (should be faster since there is no index)
  3. Get all duplicates with something like "select id, count(1) as c from table where c > 1 group by id"
  4. delete the duplicates or do whatever you need to make them unique.
  5. add a unique index (and make sure it wasn't rejected)
some
Hi. Actually, this is kind of what I did beforehand. While it works and indeed increases execution efficiency, this is overshadowed by the time wasted afterward on SELECT, DELETE, CREATE INDEX, VACUUM, etc. Anyway, I am sill hoping for an online solution (while it executes). But thank you!
A: 

"When my script detects such a non-UNIQUE INSERT, it connection.rollback() (which makes ups to 1000 rows everytime, and kind of makes the executemany worthless) and then INSERTs all values one by one."

The question doesn't really make a lot of sense.

Does EVERY block of 1,000 rows fail due to non-unique rows?

Does 1 block of 1,000 rows fail (out 5,000 such blocks)? If so, then the execute many helps for 4,999 out of 5,000 and is far from "worthless".

Are you worried about this non-Unique insert? Or do you have actual statistics on the number of times this happens?

If you've switched from 1,000 row blocks to 100 row blocks, you can -- obviously -- determine if there's a performance advantage for 1,000 row blocks, 100 row blocks and 1 row blocks.

Please actually run the actual program with actual database and different size blocks and post the numbers.

S.Lott
A: 

S.Lott: here is what I meant to say: "When my script detects such a non-UNIQUE INSERT, it calls connection.rollback() (which makes up to 1000 dead tuples, and kind of makes the executemany worthless) and then retries by INSERTing all values one by one using execute() in order to catch the failed INSERT(s)"

As for the amount of times a block of 1000 INSERTs fails is too much details for me. Actually, I considered this stuff already, which is why I now executemany 100 INSERTs instead of 1000.

I must admit that a benchmark would be interesting; this will indeed tell us about efficiency; but then I have to send up this environment...which kind of beats the point of asking you guys (in the hope of getting an easy answer).

If I base my reasoning on the inputs of the others throughout this discussion, it seems to me like executemany is as much a myth that HASH indexs...

Thank you for your help

Please revise your question. This is not an answer.
S.Lott
I cant, for I did not have an account when I wrote the question! And I do not have enough reputation to edit!! thx for the heads up though.
Nicholas Leonard
+2  A: 

just copy all the data into a scratch table with the psql \copy command, or use the psycopg cursor.copy_in() method. Then:

insert into mytable select * from (select distinct * from scratch) uniq where not exists (select 1 from mytable where mytable.mykey = uniq.mykey);

This will dedup and runs much faster than any combination of inserts.

-dg

A: 

using a MERGE statement instead of an INSERT one would solve your problem.

PostgreSQL does not support the MERGE statement at all, as of version 8.5 beta 2.
intgr