views:

146

answers:

5

Hello,

I have a database table that is growing too big (few hundred million rows) that needs to be optimized, but before I get into partitioning it, I thought I'd ask about suggestions.

Here is the usage:

0 . Table contains about 10 columns of length about 20 bytes each.

  1. INSERTS are performed at a rate of hundreds of times per second.

  2. SELECT statements are performed based on column 'a' (where a='xxxx' ) a few times per hour.

  3. DELETE statements are performed based on a DATE column. (delete where date older than 1 year) usually once per day.

The key requirement is to speed up INSERT and SELECT statements, and be able to keep history data of 1 year back without locking the whole table down while deleting.

I would guess that I must have two indexes, one for column 'a', and the other for the date field. or is it possible to optimize both ?

Will there be a necessary trade-off between speed on select and speed of delete?

Is partitioning the only solution ? What are good strategies for partitioning such table?

I'm using a PostgreSQL 8.4 database.

Best Regards, Roger Tannous.

+2  A: 

Rather than keeping it a single physical table, have you looked into PostgreSQL partitioning? It's supported as of version 8.1.

Partitioning can help you avoid the problem of choosing between fast INSERT vs fast DELETE performance. You can always partition the table by Year/Month, and just drop the partitions that you no longer need. Dropping partitions is extremely fast, and inserting into small partitions is also extremely fast.

From the manual:

Partitioning refers to splitting what is logically one large table into smaller physical pieces. Partitioning can provide several benefits:

  • Query performance can be improved dramatically for certain kinds of queries.
  • Update performance can be improved too, since each piece of the table has indexes smaller than an index on the entire data set would be. When an index no longer fits easily in memory, both read and write operations on the index take progressively more disk accesses.
  • Bulk deletes may be accomplished by simply removing one of the partitions, if that requirement is planned into the partitioning design. DROP TABLE is far faster than a bulk DELETE, to say nothing of the ensuing VACUUM overhead.
  • Seldom-used data can be migrated to cheaper and slower storage media.

The benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.

Currently, PostgreSQL supports partitioning via table inheritance. Each partition must be created as a child table of a single parent table. The parent table itself is normally empty; it exists just to represent the entire data set. You should be familiar with inheritance (see Section 5.8) before attempting to implement partitioning.

LBushkin
but partitioning the table by Year/Month optimizes for the delete instead of the insert / select, right ?
Pat
@Pat: It should optimize for both. Having small partitions improves inserts, and if most queries only need "current" data, those will be optimized as well. As for selects that operate against the entire table (all partitions), they will already be incurring high cost that the physical partitioning should be negligible.
LBushkin
So if I understand correctly, the partitioning you suggest will optimize delete and insert, it will not make the select worse because it is already bad. But the requirement is to speed "inserts and selects" instead of speed "deletes and inserts". Or am I missing something in your answer?
Pat
A: 

One solution is to partition based on insert date.

Namely, your application (or DAO) decides which table to insert into based on some logic which combines current date (or rather time since last partition slice was started) and/or te size of "last" partition. Or offload such logic into a daily script, and have the script populate some "this is the partition to use" for DAO's use.

This immediately removes your need to delete "old" rows (just drop the old partition); it also ensures that your inserts periodically start populating small table, which, among other things, speeds up the "average" INSERT/SELECT speed (worst case scenarios are still just as slow of course)

DVK
That's great for the delete but according to the question "The key requirement is to speed up INSERT and SELECT statements", wouldn't partitioning on column a be better in this case ?
Pat
A: 

If you were to break this table up into proper partitioning, you would be able to use truncate instead of delete which would decrease your maintenance costs as it does not create dead space.

Joshua D. Drake
That's great for the delete but according to the question "The key requirement is to speed up INSERT and SELECT statements", wouldn't partitioning on column a be better in this case ?
Pat
Cleaning up the delete process would also help the SELECT (not the INSERT) because it would reduce random reads from the hardware due to bloat.Yes, proper partitioning is definitely the overall solution.
Joshua D. Drake
A: 

I am no expert but it seems that partitioning on the column "a" would speed up your selects but partitioning on the date (as all the other answers are suggesting) would speed up the deleting (drop the table) but would be useless for your select.

It seems, both cases would enhance the insert performance.

Any expert care to weight in on the issue ? Is it possible / useful to partition on both fields ?

Pat
A: 

Partitioning is your answer, as others stated, but:

I'd partition on some hash(a). If a is an integer then a%256 would be good. If it is a text then something like substring(md5(a) for 2).

It will speed up inserts and selects.

For deletes I'd make them run more often but smaller and also partitioned. I'd run them every hour (at XX:30) and like this:

delete from table_name
where date<(current_date - interval '1 year')
and
  hash(a)
  =
  (extract(doy from current_timestamp) * 24
    + extract(hour from current_timestamp))::int % 256;

EDIT: I've just tested this:

create function hash(a text) returns text as $$ select substring(md5($1) for 1) $$ language sql immutable strict;
CREATE TABLE tablename (id text, mdate date);
CREATE TABLE tablename_partition_0 ( CHECK ( hash(id) = '0' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_1 ( CHECK ( hash(id) = '1' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_2 ( CHECK ( hash(id) = '2' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_3 ( CHECK ( hash(id) = '3' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_4 ( CHECK ( hash(id) = '4' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_5 ( CHECK ( hash(id) = '5' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_6 ( CHECK ( hash(id) = '6' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_7 ( CHECK ( hash(id) = '7' ) ) INHERITS (tablename); 
CREATE TABLE tablename_partition_8 ( CHECK ( hash(id) = '8' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_9 ( CHECK ( hash(id) = '9' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_a ( CHECK ( hash(id) = 'a' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_b ( CHECK ( hash(id) = 'b' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_c ( CHECK ( hash(id) = 'c' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_d ( CHECK ( hash(id) = 'd' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_e ( CHECK ( hash(id) = 'e' ) ) INHERITS (tablename);
CREATE TABLE tablename_partition_f ( CHECK ( hash(id) = 'f' ) ) INHERITS (tablename);
analyze;
explain select * from tablename where id='bar' and hash(id)=hash('bar');
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Result  (cost=0.00..69.20 rows=2 width=36)
   ->  Append  (cost=0.00..69.20 rows=2 width=36)
         ->  Seq Scan on tablename  (cost=0.00..34.60 rows=1 width=36)
               Filter: ((id = 'bar'::text) AND ("substring"(md5(id), 1, 1) = '3'::text))
         ->  Seq Scan on tablename_partition_3 tablename  (cost=0.00..34.60 rows=1 width=36)
               Filter: ((id = 'bar'::text) AND ("substring"(md5(id), 1, 1) = '3'::text))
(6 rows)

You'd need to add hash(id)=hash('searched_value') to your queries or Postgres will search all tables.


EDIT: You can also use rule system for automatic insertions to correct tables:

create rule tablename_rule_0 as
  on insert to tablename where hash(NEW.id)='0'
  do instead insert into tablename_partition_0 values (NEW.*);
create rule tablename_rule_1 as
  on insert to tablename where hash(NEW.id)='1'
  do instead insert into tablename_partition_1 values (NEW.*);
-- and so on
insert into tablename (id) values ('a');
select * from tablename_partition_0;
 id | mdate 
----+-------
 a  | 
(1 row)
Tometzky
Does PostgreSQL support hash partitioning ?
Pat
Somewhat. You'd need to add explicit hash(id)=hash(...) to your queries.
Tometzky