tags:

views:

1223

answers:

5

I'm trying to work out a pretty complex query in SQL Server 2008. I'd like some input from SQL experts here.

Imagine I had a Payments table with these fields:

PaymentID int, CustomerID int, PaymentDate datetime, Amount decimal

So essentially, it is a table of payments made by a customer on specific dates. An important thing to note is that in some cases, a payment amount can be a negative value. So, over time, the total amount paid by any given customer, can go up or down.

What we're trying to figure out is the SQL to calculate the high point of the total amount paid per customer.

So, if Fred made 3 payments: first for $5, second for $5, third for -$3. The report will show that Fred's peak total paid amount was $10 (on his second payment), and his final paid amount was $7.

We need to run this report for a hundred thousand customers (who've potentially made a hundred to a thousand payments each), so it's got to be fast.

Is there a good way to structure this query without storing the running totals in the db? We'd like to avoid storing precalculated values if at all possible.

+4  A: 

The operation is linear in the number of payments for each customer. So, you are going to have to go over each payment, keeping a running total and a high water mark and at the end of all the payments, you will have your answer. Whether you do that in a CLR stored procedure (immediately jumped to mind for me) or use a cursor or temp table or whatever, it's probably not going to be fast.

If you have to run this report over and over again, you should seriously consider keeping a high water mark field and update it (or not) whenever a payment comes in. That way, your report will be trivial -- but this is what data marts are for.

JP Alioto
JP, thanks for this. I've never tried a CLR stored procedure. My understanding is that in general, for things like this you want to avoid CLR because it just won't be as fast as TSQL optimized. I think we're going to end up storing the precalculated data somewhere.
Linus
+1  A: 
list = list of amounts ordered by date
foreach in list as amount
  running += amount
  if running >= high
    high = running

To keep it fast, you will require a running total incremented with amount on a trigger, and a high value for each customer (can also be updated by a trigger to make the re-query even simpler).

I don't think you can do this type of thing without code (stored procedures are code)

jim
Lou, I'm actually fine with the idea of using a stored procedure or a function that returns the data. What I sort of wanted to avoid was using cursors in the stored proc. Thanks.
Linus
+6  A: 

Your question seems to be this:

SELECT CustomerID, SUM(Ammount) FROM table WHERE Amount > 0 GROUP BY CustomerID
SELECT CustomerID, SUM(Ammount) FROM table GROUP BY CustomerID

However, I think you mean that you want a table that appears like this

Customer  Payment  HighPoint  RunningTotal
123       5        5          5
123       5        10         10
123       -3       10         7

In which case I would create a view with the two selects above so that the view is something like.

SELECT CusotmerID, 
  PaymentDate, 
  Ammount, 
  (SELECT SUM(Ammount) 
    FROM table as ALIAS 
    WHERE ALIAS.Amount > 0 
      AND ALIAS.PaymentDate <= PaymentDate 
      AND ALIAS.CustomerID = CustomerID), 
  (SELECT SUM(Ammount) 
    FROM table as ALIAS 
    WHERE ALIAS.CustomerID = CustomerID 
    AND ALIAS.PaymentDate <= PaymentDate)
FROM table

Also, you may consider a non-unique index on the Amount column of the table to speed up the view.

jellomonkey
This doesn't answer the question at all
jim
how does it not answer the question?
jellomonkey
wtf... did you edit? PaymentDate wasn't even in your select let alone any clear formatting pre comment.. SO bug?
jim
I did edit, I copied and pasted and then realized I grabbed the wrong thing.
jellomonkey
Haha that explains the confusion!
jim
+1  A: 

like Andomar's answer. You can do the running total for each payment. Then find the max peak payment...

with
rt as (
  select
    Payments.*,
    isnull(sum(p.Amount), 0) + Payments.Amount as running
  from
    Payments
    left outer join Payments p on Payments.CustomerID = p.CustomerID
      and p.PaymentDate <= Payments.PaymentDate
      and p.PaymentID < Payments.PaymentID
),
highest as
(
  select
    CustomerID, PaymentID, running as peak_paid
  from
    rt
  where
    PaymentID = (select top 1 rt2.PaymentID 
        from rt rt2 
        where rt2.CustomerID = rt.CustomerID
        order by rt2.running desc, rt2.PaymentDate, rt2.PaymentID)
)

select
  *,
  (select sum(amount) from Payments where Payments.CustomerID = highest.CustomerID) as total_paid  
from
  highest;

however, since you have around 1 million payments, this could be quite slow. Like others are saying, you would want to store the CustomerID, PaymentID and peak_paid in a separate table. This table could be updated on each Payment insert or as a sqljob.

Updated query to use join instead of subqueries. Since the PaymentDate does not have a time, I filter out multiple payments on the same day by the PaymentId.

dotjoe
dotjoe, that looks pretty good. The report is not run too often, so it maybe worth benchmarking this query. The one issue I see with this query is the calculation of the running total amount does not work if a customer makes more than one payment on a single day (which happens quite often), and we do not store the time component on payment date. Details.. we can figure something out from this.
Linus
if the PaymentID is auto-incremented maybe you could use that?
dotjoe
+4  A: 

As an alternative to subqueries, you can use a running total query. Here's how I set one up for this case. First create some test data:

create table #payments (
    paymentid int identity,
    customerid int,
    paymentdate datetime,
    amount decimal
)

insert into #payments (customerid,paymentdate,amount) values (1,'2009-01-01',1.00)
insert into #payments (customerid,paymentdate,amount) values (1,'2009-01-02',2.00)
insert into #payments (customerid,paymentdate,amount) values (1,'2009-01-03',-1.00)
insert into #payments (customerid,paymentdate,amount) values (1,'2009-01-04',2.00)
insert into #payments (customerid,paymentdate,amount) values (1,'2009-01-05',-3.00)
insert into #payments (customerid,paymentdate,amount) values (2,'2009-01-01',10.00)
insert into #payments (customerid,paymentdate,amount) values (2,'2009-01-02',-5.00)
insert into #payments (customerid,paymentdate,amount) values (2,'2009-01-03',7.00)

Now you can execute the running total query, which calculates the balance for each customer after each payment:

select cur.customerid, cur.paymentdate, sum(prev.amount)
from #payments cur
inner join #payments prev
    on cur.customerid = prev.customerid
    and cur.paymentdate >= prev.paymentdate
group by cur.customerid, cur.paymentdate

This generates data:

Customer  Paymentdate        Balance after payment
1         2009.01.01         1
1         2009.01.02         3
1         2009.01.03         2
1         2009.01.04         4
1         2009.01.05         1
2         2009.01.01         10
2         2009.01.02         5
2         2009.01.03         12

To look at the maximum, you can do a group by on the running total query:

select customerid, max(balance)
from (
    select cur.customerid, cur.paymentdate, balance = sum(prev.amount)
    from #payments cur
    inner join #payments prev
     on cur.customerid = prev.customerid
     and cur.paymentdate >= prev.paymentdate
    group by cur.customerid, cur.paymentdate
) runningtotal
group by customerid

Which gives:

Customer   Max balance
1          4
2          12

Hope this is useful.

Andomar
Thank you Andomar. I'll give this a shot.
Linus
+1 i like this much better than the sub queries
dotjoe
I like the #payments self-join. I still think MS should support "sum(x) over(order by xx)" though. (Can't that be used here where there is a column to partition on?)
araqnid