views:

67

answers:

1

Update: There was a comment that the question was not clear, that I made a leap of logic claiming that I would have 118 billion rows. I have edited the text below to clarify things. See the italicized text below

I have been struggling with this for a while now, have even gone down a few paths, but I turn now to the community for ideas. First, the problem — Store six daily variables for ~ 25 years for cells in a grid.

  • Number of vars = 6
  • Number of cells ~ 13 million
  • Number of days ~ 9125 (25 * 365)

Optimize the store for two different kinds of queries —

Query one: Retrieve the value of a single var for all or a portion of the cells for a single day. This is analogous to an image where every pixel is the value of a single var.

Query two: Retrieve values for all the days or a duration of days for a single var for a single cell. This is like grabbing a column out of a table in which each row holds all the vars for a single day.

So, I set about designing the db. A single table where every row is one day's values for one cell would like so

CREATE TABLE d (
    yr      SMALLINT,
    yday    SMALLINT,
    a       SMALLINT,
    b       SMALLINT,
    d       SMALLINT,
    e       SMALLINT,
    f       SMALLINT,
    g       SMALLINT,
    cell_id INTEGER
)
WITH (
    OIDS=FALSE
)

The data would look like so

yr      yday    a   b   c   d   e   f   g   cell_id
----------------------------------------------------
1980    1       x   x   x   x   x   x   x   1
1980    2       x   x   x   x   x   x   x   1
1980    3       x   x   x   x   x   x   x   1
..
1980    365     x   x   x   x   x   x   x   1
...
1981    1       x   x   x   x   x   x   x   1
1981    2       x   x   x   x   x   x   x   1
1981    3       x   x   x   x   x   x   x   1
..
1981    365     x   x   x   x   x   x   x   1

The problem: The above table would have 13 m * 9125 rows ~ 118 billion rows. Huge indexes, slow queries, major issues loading the data, etc.

So, I partitioned the table into years like so

CREATE TABLE d_<yyyy> (
    CHECK ( yr = <yyyy> )
) INHERITS (d)

Hmmm... still no satisfaction. I ended up with 1 + 25 tables, but each of the year tables now had ~ 4.75 billion rows, and the queries were still very slow.

So, I partitioned it all by years and days like so

CREATE TABLE d_<yyyy>_<yday> (
    CHECK ( yr = <yyyy> AND yday = <yday> )
) INHERITS (d)

Each table now has 13 million rows, and is reasonably fast (although still not satisfactorily fast), but now I have 9K tables. That has its own problems. I can't query the master table anymore as Pg tries to lock all the tables and runs out of memory. Additionally, I can't anymore conduct query two above. I could do something like

SELECT a FROM d_1980_1 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_2 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_3 WHERE cell_id = 1 
UNION 
SELECT a FROM d_1980_4 WHERE cell_id = 1 
UNION 
...

But the above is hardly optimal.

Any suggestions, ideas, brainstorms would be appreciated. Perhaps Pg, or even a RDBMS, is not the right tool for this problem, in which case, suggestion for alternatives would be welcome as well.

+1  A: 

Without resorting to massive indexing, or duplication of data, I think it will be difficult to find a single schema design, which is optimal for both of your queries.

By clustering your data by either date or sensor, retrieval by one of these conditions can be made to run fast, but not both at the same time.

Assuming that access by date is the most important, you could layout your table like below:

CREATE TABLE d (
    day      DATE,
    a       SMALLINT[],
    b       SMALLINT[],
    ...
);

Observe that there is now only one row per day, and that cell fields have become arrays, where each cell will have its own index. In case the cell numbering is not zero-based, a table could be fitted with the mapping from cell ids to array indexes.

Query 1,

Retrieve the value of a single var for all or a portion of the cells for a single day.

is accomplished by, for example,

SELECT a FROM d WHERE day = '1981-01-01'

Query 2,

Retrieve values for all the days or a duration of days for a single var for a single cell.

will be of the form

SELECT a[1000] FROM d WHERE day BETWEEN '1981' AND '1982'

I believe large arrays in PostgreSQL are accessed without actually loading the whole data structure. I know BLOBs are. If that is the case, this solution may be fast enough for you. Otherwise I would suggest making another view of the data, which optimizes access by cell.

Anders Johannsen