views:

65

answers:

1

I store time-series simulation results in PostgreSQL. The db schema is like this.

table SimulationInfo (
    simulation_id integer primary key,
    simulation_property1, 
    simulation_property2, 
    ....
)
table SimulationResult (  // The size of one row would be around 100 bytes
    simulation_id integer,
    res_date Date,
    res_value1,
    res_value2,
    ...
    res_value9,
    primary key (simulation_id, res_date)

)

I usually query data based on simulation_id and res_date.

I partitioned the SimulationResult table into 200 sub-tables based on the range value of simulation_id. A fully filled sub table has 10 ~ 15 millions rows. Currently about 70 sub-tables are fully filled, and the database size is more than 100 gb. The total 200 sub tables would be filled soon, and when it happens, I need to add more sub tables.

But I read this answers, which says more than a few dozen partitions does not make sense. So my questions are like below.

  1. more than a few dozen partitions not make sense? why? I checked the execution plan on my 200 sub-tables, and it scan only the relevant sub-table. So i guessed more partitions with smaller each sub-table must be better.

  2. if number of partitions should be limited, like 50, then is it no problem to have billions rows in one table? How big one table can be without big problem given the schema like mine?

+2  A: 

It's probably unwise to have that many partitions, yes. The main reason to have partitions at all is not to make indexed queries faster (which they are not, for the most part), but to improve performance for queries that have to sequentially scan the table based on constraints that can be proved to not hold for some of the partitions; and to improve maintenance operations (like vacuum, or deleting large batches of old data which can be achieved by truncating a partition in certain setups, and such).

Maybe instead of using ranges of simulation_id (which means you need more and more partitions all the time), you could partition using a hash of it. That way all partitions grow at a similar rate, and there's a fixed number of partitions.

The problem with too many partitions is that the system is not prepared to deal with locking too many objects, for example. Maybe 200 work fine, but it won't scale well when you reach a thousand and beyond (which doesn't sound that unlikely given your description).

There's no problem with having billions of rows per partition.

All that said, there are obviously particular concerns that apply to each scenario. It all depends on the queries you're going to run, and what you plan to do with the data long-term (i.e. are you going to keep it all, archive it, delete the oldest, ...?)

alvherre
thnx so much alvherre. I should keep all the simulation results to query various historical statistics. And the reason i used ranges of simulation_id for partition is that i guessed it would be good to save results of adjacent simulations together in one partition bcz i usually query group of adjacent simulations together. Anyway, you helped me to release all the concerns. I'll follow the way to use hash partition with limited number of partitions.
tk
If you are going to query adjacent simulations together, then it's probably a good idea to choose a mapping that puts several adjacents simulations in the same partition. (So don't use plain modulo arithmetic for the hash). On the other hand, make sure you use constraints that can be proved true or false for each partition for whatever query you're going to use, so that partitions that don't contain any simulation in your result set can be quickly discarded (as with range partitioning).
alvherre
alvherre, it seems that hash partition is not supported yet. http://wiki.postgresql.org/wiki/Table_partitioningDo you know how to work around for hash partitioning? Thnx.
tk