tags:

views:

57

answers:

1

I realize that, per Pg docs (http://www.postgresql.org/about/), one can store an unlimited number of rows in a table. However, what is the "rule of thumb" for usable number of rows, if any?

Background: I want to store daily readings for a couple of decades for 13 million cells. That works out to 13 M * (366|365) * 20 ~ 9.5e10, or 95 B rows (in reality, around 120 B rows).

So, using table partitioning, I set up a master table, and then inherited tables by year. That divvies up the rows to ~ 5.2 B rows per table.

Each row is 9 SMALLINTs, and two INTs, so, 26 bytes. Add to that, the Pg overhead of 23 bytes per row, and we get 49 bytes per row. So, each table, without any PK or any other index, will weigh in at ~ 0.25 TB.

For starters, I have created only a subset of the above data, that is, only for about 250,000 cells. I have to do a bunch of tuning (create proper indexes, etc.), but the performance is really terrible right now. Besides, every time I need to add more data, I will have to drop the keys and the recreate them. The saving grace is that once everything is loaded, it will be a readonly database.

Any suggestions? Any other strategy for partitioning?

+1  A: 

It's not just "a bunch of tuning (indexes etc.)". This is crucial and a must do.

You posted few details, but let's try.

The rule is: Try and find the most common working set. See if it fits in RAM. Optimize hardware, PG/OS buffer settings and PG indexes/clustering for it. Otherwise look for aggregates, or if it's not acceptable and you need fully random access, think what hardware could scan the whole table for you in reasonable time.

How large is your table (in gigabytes)? How does it compare to total RAM? What are your PG settings, including shared_buffers and effective_cache_size? Is this a dedicated server? If you have a 250-gig table and about 10 GB of RAM, it means you can only fit 4% of the table.

Are there any columns which are commonly used for filtering, such as state or date? Can you the working set that is most commonly used (like only last month)? If so, consider partitioning or clustering on these columns, and definitely index them. Basically, you're trying to make sure that as much of the working set as possible fits in RAM.

Avoid scanning the table at all costs if it does not fit in RAM. If the you really need absolutely random access, the only way it could be usable is really sophisticated hardware. You would need a persistent storage/RAM configuration which can read 250 GB in reasonable time.

Konrad Garus