Update: There was a comment that the question was not clear, that I made a leap of logic claiming that I would have 118 billion rows. I have edited the text below to clarify things. See the italicized text below
I have been struggling with this for a while now, have even gone down a few paths, but I turn now to the community for ideas. First, the problem — Store six daily variables for ~ 25 years for cells in a grid.
- Number of vars = 6
- Number of cells ~ 13 million
- Number of days ~ 9125 (25 * 365)
Optimize the store for two different kinds of queries —
Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var.
Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day.
So, I set about designing the db. A single table where every row is one day's values for one cell would like so
CREATE TABLE d (
yr SMALLINT,
yday SMALLINT,
a SMALLINT,
b SMALLINT,
d SMALLINT,
e SMALLINT,
f SMALLINT,
g SMALLINT,
cell_id INTEGER
)
WITH (
OIDS=FALSE
)
The data would look like so
yr yday a b c d e f g cell_id
----------------------------------------------------
1980 1 x x x x x x x 1
1980 2 x x x x x x x 1
1980 3 x x x x x x x 1
..
1980 365 x x x x x x x 1
...
1981 1 x x x x x x x 1
1981 2 x x x x x x x 1
1981 3 x x x x x x x 1
..
1981 365 x x x x x x x 1
The problem: The above table would have 13 m * 9125 rows ~ 118 billion rows. Huge indexes, slow queries, major issues loading the data, etc.
So, I partitioned the table into years like so
CREATE TABLE d_<yyyy> (
CHECK ( yr = <yyyy> )
) INHERITS (d)
Hmmm... still no satisfaction. I ended up with 1 + 25 tables, but each of the year tables now had ~ 4.75 billion rows, and the queries were still very slow.
So, I partitioned it all by years and days like so
CREATE TABLE d_<yyyy>_<yday> (
CHECK ( yr = <yyyy> AND yday = <yday> )
) INHERITS (d)
Each table now has 13 million rows, and is reasonably fast (although still not satisfactorily fast), but now I have 9K tables. That has its own problems. I can't query the master table anymore as Pg tries to lock all the tables and runs out of memory. Additionally, I can't anymore conduct query two above. I could do something like
SELECT a FROM d_1980_1 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_2 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_3 WHERE cell_id = 1
UNION
SELECT a FROM d_1980_4 WHERE cell_id = 1
UNION
...
But the above is hardly optimal.
Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg, or even a RDBMS, is not the right tool for this problem, in which case, suggestion for alternatives would be welcome as well.