views:

145

answers:

5

I need to store large amounts of metering data in a database. A record consists of an id that identifies the data's source, a timestamp and a value. The records are later retrieved via the id and their timestamp.

According to my previous experience (I am developing the successor of an application that's been in productive use over the last five years), disk i/o is the relevant performance bottleneck for data retrieval. (See also this other question of mine).

As I am never looking for single rows but always for (possibly large) groups of rows that match a range of ids and timestamps, a pretty obvious optimization seems to be to store larger, compressed chunks of data that are accessed by a much smaller index (e. g. by a day number) and is decompressed and filtered on the fly by the application.

What I'm looking for is the best strategy for deciding what portion of the data to put in one chunk. In a perfect world, each user request would be fulfilled by retrieving one chunk of data and using most or all of it. So I want to minimize the amount of chunks I have to load for each request and I want to minimize excess data per chunk.

I'll post an answer below containing my ideas so far, and make it community property so you can expand on it. Of course, if you have a different approach, post your own.

ETA: S. Lott has posted this answer below, which is helpful to the discussion even if I can't use it directly (see my comments). The point here is that the "dimensions" to my "facts" are (and should be) influenced by the end user and change over time. This is a core feature of the app and actually the reason I wound up with this question in the first place.

A: 

Option 1:

Make a good guess on what is going to be loaded together often and put that in one, not too large chunk. Example: have one chunk a day

Pros:

  • easy, data lookup can be done with a simple calculation (which days are involved in the request time range?), rather than keeping an index of what went where.
  • archive structure is easy to understand without tools

Cons:

  • not the best performance
  • does not adapt to changing behaviour of the application's users
Hanno Fietz
A: 

Option 2:

Develop a clever "rebalancing strategy", that would keep track of data being loaded together and tries to put stuff together that gets loaded together often. This could include keeping copies of rows in multiple chunks.

Pros:

  • can be almost arbitrarily smart
  • could be made very effective performance-wise
  • allows for an evolving strategy

Cons:

  • can be almost arbitrarily hard to develop, test and debug
  • might suffer from poor performance until the self-optimization has kicked in
  • multiple copies of records could bloat storage
  • somehow I feel this should be done by the database
Hanno Fietz
+1  A: 

Option 3:

Find a clever database feature that would do the job.

Hanno Fietz
+1  A: 

For options 1 & 3 you need a really good idea of what your most frequent queries are going to be. Use the 80/20 rule, don't try to make all queries perform at the same level.

Option 2 sounds intriguing, but the bookkeeping could get a little hairy.

Option 3 has some promise in that it could address the performance issue with little or no change to the application. Two things I'd suggest looking into:

  1. Table Partitioning. Oracle and MS Sql Server (and others, I'm sure) support grouping the data in a table physically by some value (in this case, date/time stamp). You can configure the partitions to reside on different physical devices to spread load across hardware, hopefully reducing latency.
  2. Indexing with included columns. This always feels counter-intuitive to me, but by adding the columns that you want to retrieve to the index, entire queries can be executed without hitting the actual tables.

The downside of both of these options is on mutation operations (insert/delete) where indexes have to be updated.

You might want to try a combination of 1 and 3 (they're similar in a lot of ways) with a flavor of #2. Keep track of statistics about which time periods are most frequently queried (option 2) and revisit the strategies in 1 or 3 periodically until the performance for the core queries is acceptable.

Ken Gentle
+2  A: 

"groups of rows that match a range of ids and timestamps"

You have two dimensions: the source and time. I'm sure the data source has lots of attributes. Time, I know, has a lot of attributes (year, month, day, hour, day of week, week of year, quarter, fiscal period, etc., etc.)

While your facts have "just" an ID and a timestamp, they could have have FK's to the data source dimension and the time dimension.

Viewed as a star-schema, a query that locates "groups of rows that match a range of ids" may -- more properly -- be a group of rows with a common data source attribute. It isn't so much a random cluster of ID's, it's a cluster of ID's defined by some common attribute of your dimensions.

Once you define these attributes of the data source dimension, your "chunking" strategy should be considerably more obvious.

Further, you may find that the bit-mapped index capability of some database products makes it possible to simply store your facts in a plain-old table without sweating the chunk design at all.

If bit-mapped indexes still aren't fast enough, then perhaps, you have to denormalize the data source attributes into both dimension and fact, and then partition the fact table on this dimensional attribute.

S.Lott
You're right about the start schema. However, "defining the attributes of the data source dimension" is not at all trivial and even much more dynamic than the storage strategy I'm looking for here. This is because the user can group and classify sources to her heart's delight.
Hanno Fietz
To come up with a single id and the time already meant removing some dimensions (such as the physical unit) from the original schema that's currently in production (and has been for some years). My goal is to separate that logic from the storage and retrieval part to make it more flexible.
Hanno Fietz
@Hanno: "the user can group and classify sources to her heart's delight" Defeating all possible chunking strategies. I stand by bitmapped indexes.
S.Lott
@Hanno: I would not separate the user's grouping and clustering from storage. The user's query results are what you're supposed to be optimizing. I think there's no "pixie dust" that will solve this problem magically. It's hard.
S.Lott