views:

191

answers:

3

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?

This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.

I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").

On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.

Is there anyway I can have the best of both worlds?

+3  A: 

We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:

  1. You have a lot of flexibility for querying any way you want. Before we had to build specific aggregates, but now one cube answers all our questions.
  2. Storage in a cube is far smaller than the detailed data.
  3. Building and processing the cubes takes less time and produces less load on the database servers than the aggregates did.

Some CONS:

  1. There is a learning curve around building cubes and learning MDX.
  2. We had to create some tools to automate working with the cubes.

UPDATE: Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.

TskTsk
+1 for mentioning Pentaho. Some of the folks involved in Pentaho come from Cognos of BI fame.
celopes
A: 

I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.

Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.

OMG Ponies
Did I miss the part where he suggested aggregating and deleting the original data? Of course the raw data needs to be kept. But in addition to the raw data, some aggregate data is ok to store too.
marcc
@marcc: Where did I say that the original data would be deleted?
OMG Ponies
@Ponies: Maybe when you said that once aggregated you can't go back :)
Daniel Vassallo
@Daniel: That has *nothing* to do with deletion. When I get a value from `SUM(col1 * col2)`, the larger the data set the less likely I can revert that value to what it is composed of.
OMG Ponies
A: 

It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.

But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.

Menno