ansaurus

Question

Bulk Insert of hundreds of millions of records

Answer 1

+2 A:

Split your input into separate files outside the database and upload each one using COPY, rather than relying on the rule to distribute them. If the rule you give is any example, that's a trivial text transformation to apply. Also, splitting up front will let you load the split files in parallel if your disk system is up to it.

Seriously, don't rely on the rule to do this distribution for a bulk load. It's practically always the case that bulk load and transactional load need different approaches, unless you're prepared to brute-force one or the other (and, usually, wait).

For instance, your rule uses date_part() to extract the month from the date- so in order to determine the child table, postgres needs to analyse the date string, convert it to a timestamp, and then convert the timestamp back to a calendar just to get the month field back out again. But if you're writing something to do this upfront, you can just do substr($date,5,2) (or equivalent): which do you think will be faster?

It's also an opportunity to clean up the data format so COPY will accept it. Note you can specify the columns with the COPY command: if you weren't doing that with that schema and example file, you'd get errors due to the extra "id" column on the front. ("copy from ... with csv header" may have figured that out, but maybe not... the "header" option may just make it skip the first line).

I've just loaded about 280e6 rows into a postgresql instance myself in a few hours so it's certainly not impossible. For this initial load, I've turned fsync=off; the plan is to load the backlog and then turn it back on again for regular daily loads. I had to set checkpoint_segments=40 to avoid getting checkpoint warnings in the logs. This is just being loaded onto my dev machine- I'm using a dedicated disk for the database, which is different from the disk used for xlogs (i.e. I created a tablespace on the big disk and created the database inside that tablespace). The instance has shared_buffers set to 1Gb, and checkpoint_target set to 0.5. I tried loading some of the partitions in parallel and it didn't provide much improvement, so I suspect the slow disk is being the bottleneck rather than the DB itself.

Just another 1.7e9 rows to go... should be finished tomorrow sometime I hope.

araqnid 2010-05-16 13:07:08

Quite appreciated, thank you.

Dave Jarvis 2010-05-16 17:23:03

Answer 2

A:

PostgreSQL documentation contains a page on populating a database, which might help you once you've followed araqnid's advice to pre-process the input so you can use COPY.

Stephen Denne 2010-05-17 03:57:51

ansaurus

tags:

views:

answers:

Bulk Insert of hundreds of millions of records

related questions