views:

188

answers:

4

Hi,

I am having a database with tables having billions of rows in a single table for a month and I am having data for the past 5 years. I tried to optimize the data in all possible ways, but the latency is not decreasing. I know there are some solutions like using horizantal shrading and vertical shrading. But I am not sure about any open source implementations and the development time required to make the switch. Does anyone have any experience with using such systems?

Thank you.

+5  A: 

nobody can suggest anything without a use case. When you have data that's "Sagan-esque" in magnitude, the use case is all important, since, as you've likely discovered, there simply isn't any "general" technique that works. The numbers are simply too large.

So, you need to be clear about what you want to do with this data. If the answer is "everything" then, you get slow performance, because you can't optimize "everything".

Edit:

Well, which is it? 2 or 3? How big are the result sets? Do you need access to all 5 years or just the last month? Do you really need all that detail, or can it be summarized? Do you need to sort it? Are the keys enough? How often is the data updated? How fast does the data need to be online once it is updated? What kind of service level does the data need to have? 24x7x7 ? 9-5x5? Day old data is OK? Who's using the data? interactive users? Batch reports? Exports to outside entities?

Will Hartung
I want to optimize for read performance based on two - three keys in the table.
Algorist
"Sagan-esque". I think I might have to use that one.
Nick Johnson
+1  A: 

Postgress supports partioning tables. If nothing else read their documentation. Answering Will Hartung's questions will help a lot in arriving at a solution.

BillThor
+3  A: 

Read up on Data Warehousing...

  1. Capture data in flat files. Do not load a database.

  2. Design a proper Star Schema architecture.

  3. Write programs to do dimensional conformance; those programs will load dimension changes only to a database.

  4. Write programs to load selected flat-file records into a datamart with a copy of the dimensions.

Do not load a database with raw data. Ever.

S.Lott
A: 

How many GB of data is this? This reminds me of the story of LinkIn and how to calculate the social network fast enough, they had to run everything in memory. StackOver itself runs on a server with lots of memory and has most of the database in memory at any one time, according to the SO podcast.

Also reminds me google's problem, which required custom software and tons of cheap machines working in tandem.

MatthewMartin