views:

115

answers:

3

I have a database with a table, storing changes in account-balance across a couple of accounts, with three columns;

float   balance, #The account balance after the change
Date    date,    #Date that balance change occurred
int     aid      #Account that the balance change occurred on

It contains a couple of entries for each day of the year, and I want to retrieve the balance of every five days. I also want it to separate between accounts (ie if two changes occurred on the same day, but on separate accounts, return both).

The problem is this: Sometimes there will be several days (or weeks) where there is no data available. When that occurs, I want to make sure to return the latest entry before the "hole" in the dataset. This is a simplified version of the problem, the actual database is big (several gigabytes), the size is the reason why I want to return a subset of the data. It cannot use platform specific methods, because it needs to work on both oracle and mySQL.

My question is: Is there any way to do this fast? I would be able to write a query that gets the job done, but I am hoping there is some devil magic way of doing it that does not require lots of nested queries and aggregate functions..

A: 

If you wait for postgresql 8.4 you might be able to make use of Window Functions

http://www.postgresql.org/docs/8.4/static/tutorial-window.html

http://www.postgresql.org/docs/8.4/static/functions-window.html

mikelikespie
+1  A: 

You can do this in a relatively straightforward way by creating a period table, which you can join with the accounts table to create one row per account per period.

Here's an example. Let's set up some temporary tables:

create table #balance (
    id int identity,
    balance float,
    date datetime,
    aid int
)

create table #period (
    id int identity,
    startdt datetime,
    enddt datetime
)

Enter some test data:

insert into #yourtable (balance, date, aid) values (4,'2009-01-01',1)
insert into #yourtable (balance, date, aid) values (5,'2009-01-10',1)
insert into #yourtable (balance, date, aid) values (6,'2009-01-10',1)
insert into #yourtable (balance, date, aid) values (7,'2009-01-16',1)
insert into #yourtable (balance, date, aid) values (2,'2009-01-01',2)
insert into #yourtable (balance, date, aid) values (3,'2009-01-10',2)
insert into #yourtable (balance, date, aid) values (4,'2009-01-10',2)
insert into #yourtable (balance, date, aid) values (5,'2009-01-16',2)

insert into #period (startdt, enddt) values ('2009-01-01','2009-01-06')
insert into #period (startdt, enddt) values ('2009-01-06','2009-01-11')
insert into #period (startdt, enddt) values ('2009-01-11','2009-01-16')
insert into #period (startdt, enddt) values ('2009-01-16','2009-01-21')

Now let's query all periods:

from #period p

Add one row for each balance before the end of the period:

left join #balance b1 on 
    b1.date <= p.enddt

Search for balances in between the balance from the first join, and the end of the period:

left join #balance b2 on 
    b2.aid = b1.aid
    and b1.id < b2.id
    and b2.date <= p.enddt

Then filter out the rows that are not the last balance for their period.

where
    b2.aid is null

The b2 join basically looks for the "in-between" value, and by saying it's id is null, you say no in-between row exists. The final query looks like this:

select 
    b1.aid
,   p.startdt
,   b1.balance
from #period p
left join #balance b1 on 
    b1.date <= p.enddt
left join #balance b2 on 
    b2.aid = b1.aid
    and b1.id < b2.id
    and b2.date <= p.enddt
where
    b2.aid is null
order by b1.aid, p.startdt

Note: the queries assume a balance with a later date always has a larger id. If you never have to balances with exactly the same end date, you can replace "b1.id < b2.id" with "b1.date < b2.date".

Andomar
+2  A: 

I would use Andomar's Period table idea, but I would try a slightly different final query. This assumes that your Account_Balances table has a PK on aid and date. If you ended up with two balances for the same account for the same exact date and time then you would get some duplicate rows.

SELECT
     P.start_date,
     P.end_date,
     AB1.account_id,
     AB1.balance
FROM
     Periods P
LEFT OUTER JOIN Account_Balances AB1 ON
     AB1.date <= P.end_date
LEFT OUTER JOIN Account_Balances AB2 ON
     AB2.aid = AB1.aid AND
     AB2.date > AB1.date AND
     AB2.date <= P.end_date
WHERE
     AB2.aid IS NULL

If the account has no rows before or during the given period you will not get a row back for it.

Tom H.
ool, looks better than mine: you're not doing the "distinct aid", and the noone-in-between join is probably faster. Is it ok if I change my query based on yours?
Andomar
@Andomar: Yep, feel free. Although sometimes one method might be faster than another depending on the data. Most of the times I find the LEFT OUTER JOIN to be faster though.
Tom H.
Thanks, edited. Excpect for the date check, which didn't work for the test data because it has multiple balances with the same date.
Andomar
Just to clarify, if there were a time portion to the dates, it would still work by using the latest balance for the given day. If two dates are EXACTLY the same though then SQL doesn't know which one is really wanted without clearer business rules.
Tom H.