How Do I aggregate Data By Day and Still Respect Timezone?

views:

answers:

How Do I aggregate Data By Day and Still Respect Timezone?

We are currently using a summary table that aggregates information for our users on an hourly basis in UTC time. The problem we are having is that this table is becoming too large and slowing our system down immensely. We have done all the tuning techniques recommended for PostgreSQL and we are still experiencing slowness.

Our idea was to start aggregating by day rather than by hour, but the problem is that we allow our customers to change the timezone, which recalculates the data for that day.

Does anyone know of a way to store the daily summary but still respect the numbers and totals when they switch timezones?

+1 A:

Summarise the data in tables with a timeoffset column, and a "day" field (a date) that is the day for that particular summary line. Index on (timeoffset, day, other relevant fields), clustered if possible (presumably PostgresSQL has clustered indexes?) and all should be well.

Will A 2010-08-06 22:39:16

So, instead of 24 lines per day, one day would produce one line... times 24 time zones. I can't see substantial gain here.

MPelletier 2010-08-06 22:47:17

i thought about this, but then i have to maintain 24 summary tables which will also increase the possibility of a difference in reporting between timezones.

Russ Bradberry 2010-08-06 22:53:30

@MPelletier - the difference is that you don't need to aggregate the 24 lines for a day to produce a daily figure - you pull out the one summary line for that particular timeoffset / day - so you're doing 1/24 of the work - with proper indexing of course.

Will A 2010-08-06 23:04:39

@Russ - you don't have 24 summary tables - just the one, but with a timeoffset column which indicates the # of hours from GMT, and the day column indicates the day for that particular time offset. You'd have 24 rows for the 24 time offsets (or more or less if you need more or less time zones).

Will A 2010-08-06 23:05:56

the hourly table has over 10M rows already, which is why i beleive the performance is degrading. not necessarily in the number of rows it has to aggregate, but the total amount it has to filter. the indexes are growing too large i beleive.

Russ Bradberry 2010-08-06 23:12:40

also, my clustered index is on two other important columns, and with postgres, you only get one per table

Russ Bradberry 2010-08-06 23:13:37

@Russ Bradberry: Can you trick Postgre with views? Say you index the day, is there a gain for a timezone-specific view? Or the other way around: a nested query for days (indexed) inside a query for the timezone?

MPelletier 2010-08-07 05:34:06

@Will A, this may be viable with a columnar db solution. ive been looking into a couple and while it is a hefty initial overhead, it may prove to be a good solution.

Russ Bradberry 2010-08-08 03:22:45

@Russ - I don't think you need worry about your existing clustered index - if you add timeoffset and day to the start of that index you'll still reap the benefit from it *and* will only need to scan through the relevant rows - it's the "24 tables" approach in one easier to maintain table. :) Will be interested in hearing how this goes.

Will A 2010-08-08 09:42:16

I'm assuming you've went through all the partitioning considerations, such as partitioning by user.

I can see several solutions to your problem, depending on the usage pattern.

Aggregate data per day, per user selection. In the event of timezone change, programatically recalculate the aggregate for this partner. This is plausible if timezone changes are infrequent and if a certain delay in data may be introduced when a user changes timezones.
If you have relatively few measures, you may maintain 24 columns for each measure - each describing the daily aggregate for the measure in a different timezone.
If timezone changes are frequent and there are numerous measures, it seems like 24 different aggregate tables would be the way to go.

shmichael 2010-08-07 22:58:21

the timezone changes are, in fact, relatively few. i could programmatically recalculate the measures based on the change but the first change would have a significant delay.we have about 8 measures, 24 columns per measure would not be a good idea. im beginning to think that 24 tables is the way to go. i looked into @Will A's solution and it may be viable with a columnar db. but not with a db that degrades with the number of rows.

Russ Bradberry 2010-08-08 03:21:30

192 integer columns isn't too bad, actually.And if you'll be using a columnar DB, I don't think you'll need any schema change whatsoever - at least not with the aforementioned problem in mind.

shmichael 2010-08-08 12:50:12

ansaurus

tags:

views:

answers:

How Do I aggregate Data By Day and Still Respect Timezone?

related questions