views:

110

answers:

2

What is the fastest way to insert 237 million records into a table that has rules (for distributing the data across 84 child tables)?

First I tried inserts. No go. Then I tried inserts with BEGIN/COMMIT. Not nearly fast enough. Next, I tried COPY FROM, but then noticed the documentation states that the rules are ignored. (And it was having difficulties with the column order and date format -- it said that '1984-07-1' was not a valid integer; true, but a bit unexpected.)

Some example data:

station_id,taken,amount,category_id,flag
1,'1984-07-1',0,4,
1,'1984-07-2',0,4,
1,'1984-07-3',0,4,
1,'1984-07-4',0,4,T

Here is the table structure (with one rule included):

CREATE TABLE climate.measurement
(
  id bigserial NOT NULL,
  station_id integer NOT NULL,
  taken date NOT NULL,
  amount numeric(8,2) NOT NULL,
  category_id smallint NOT NULL,
  flag character varying(1) NOT NULL DEFAULT ' '::character varying
)
WITH (
  OIDS=FALSE
);
ALTER TABLE climate.measurement OWNER TO postgres;

CREATE OR REPLACE RULE i_measurement_01_001 AS
    ON INSERT TO climate.measurement
   WHERE date_part('month'::text, new.taken)::integer = 1 AND new.category_id = 1 DO INSTEAD  INSERT INTO climate.measurement_01_001 (id, station_id, taken, amount, category_id, flag)
  VALUES (new.id, new.station_id, new.taken, new.amount, new.category_id, new.flag);

I can generate the data into any format.

Am looking for something that won't take four days.

I originally had the data in MySQL (still do), but am hoping to get a performance increase by switching to PostgreSQL and am eager to use its PL/R extensions for stats.

I was also thinking about using: http://pgbulkload.projects.postgresql.org/

Any help, tips, or guidance would be greatly appreciated.

Thank you!

+2  A: 

Split your input into separate files outside the database and upload each one using COPY, rather than relying on the rule to distribute them. If the rule you give is any example, that's a trivial text transformation to apply. Also, splitting up front will let you load the split files in parallel if your disk system is up to it.

Seriously, don't rely on the rule to do this distribution for a bulk load. It's practically always the case that bulk load and transactional load need different approaches, unless you're prepared to brute-force one or the other (and, usually, wait).

For instance, your rule uses date_part() to extract the month from the date- so in order to determine the child table, postgres needs to analyse the date string, convert it to a timestamp, and then convert the timestamp back to a calendar just to get the month field back out again. But if you're writing something to do this upfront, you can just do substr($date,5,2) (or equivalent): which do you think will be faster?

It's also an opportunity to clean up the data format so COPY will accept it. Note you can specify the columns with the COPY command: if you weren't doing that with that schema and example file, you'd get errors due to the extra "id" column on the front. ("copy from ... with csv header" may have figured that out, but maybe not... the "header" option may just make it skip the first line).

I've just loaded about 280e6 rows into a postgresql instance myself in a few hours so it's certainly not impossible. For this initial load, I've turned fsync=off; the plan is to load the backlog and then turn it back on again for regular daily loads. I had to set checkpoint_segments=40 to avoid getting checkpoint warnings in the logs. This is just being loaded onto my dev machine- I'm using a dedicated disk for the database, which is different from the disk used for xlogs (i.e. I created a tablespace on the big disk and created the database inside that tablespace). The instance has shared_buffers set to 1Gb, and checkpoint_target set to 0.5. I tried loading some of the partitions in parallel and it didn't provide much improvement, so I suspect the slow disk is being the bottleneck rather than the DB itself.

Just another 1.7e9 rows to go... should be finished tomorrow sometime I hope.

araqnid
Quite appreciated, thank you.
Dave Jarvis
A: 

PostgreSQL documentation contains a page on populating a database, which might help you once you've followed araqnid's advice to pre-process the input so you can use COPY.

Stephen Denne