views:

24

answers:

1

I'm trying to do a cumulative aggregate in MySQL.

My raw data looks like this: (my dates are month/day/year)

user_id    created_at    source
1          1/1/01        foo
2          1/1/01        foo
3          1/2/01        bar
4          1/2/01        foo
5          1/3/01        foo
     ....

Given this, I'd like to make a result that looks like this:

date    source    total_in_source
1/1/01  foo       2
1/2/01  foo       3
1/3/01  foo       4

where total_in_source is an aggregate of the number of users created_at between the first and the current value of date.

I can do this using a correlated subquery, but that's a bit inefficient:

select 
date(user.created_at) d,
user.source as user_source,
(select count(*) from users u where u.source = user_source and month(u.created_at) <= month_joined) as total_users_source,
from users user group by d, user_source

I've done some looking around and it seems like doing this might be a step in the right direction:

select date(u1.created_at) as 'd',
u1.source as 'source', 
count(distinct u2.id) as count_users
from users u1, users u2
where
u2.created_at BETWEEN DATE(u1.created_at) AND DATE_ADD(DATE(u1.created_at), interval 1 day) and
group by d, source

but even this is a self-join and produces n^2 rows to iterate over (users * users). Is there a best-practice on how to do this?

thanks in advance.

+1  A: 

You'll want to "artificially create" the start and end dates to join on either in a temporary table or directly inline.

Let's say you wanted every combo of start, end month and then inside that time range break down by source and count as you described. You could do this:

select u.source, sdates.val start, edates.val end, count(*) from (select distinct(date(created_at)) val from users where day(created_at) = 1) sdates join (select distinct(date(created_at)) val from users where day(created_at) = 1) edates join users u on u.created_at between sdates.val and edates.val group by sdates.val, edates.val, u.source;

Basically the inner queries "get you the dates" and the outer one actually does the computation. It can be helpful to have a "dates" table around that has one row for each relevant date to your project so you can easily join on it and not have to do this voodoo, but in the absence of that this seems to work.

Also, you might not want every combination of start and end, so you can play with the 'where' conditions on the inner queries or the 'on' of the join to suit your needs.

Curious if anyone has a better solution.

zellunit