views:

53

answers:

3

Hello. I have a data table (will be millions of records but I will make it simple here) that looks like this.

ID   APPROVAL_DT       DAY_DT        TRANS_COUNT     SALE_AMOUNT
1    2010-04-22        2010-04-27    2               260
1    2010-04-22        2010-04-28    1               40
2    2010-03-28        2010-04-02    1               5
2    2010-03-28        2010-04-03    5               10
2    2010-03-28        2010-04-04    1               20
3    2010-04-25        2010-05-01    6               10
3    2010-04-25        2010-05-02    4               10
4    2010-06-01        2010-06-07    1               5

I need to figure out the DAY_DT for each ID where either the sum of all previous and current DAY_DT TRANS_COUNTs >=10 OR sum of all previous and current DAY_DT SALE_AMOUNTs >= 25

So the results of the query applied to the above table would be

ID   APPROVAL_DT  ACTIVATED_DT
1    2010-04-22   2010-04-27
2    2010-03-28   2010-04-04
3    2010-04-25   2010-05-02
4    2010-06-01   NULL

Any thoughts?

+1  A: 

How many records per id will you have?

Itzik Ben Gan did a comparison of various approaches to tackling running totals in SQL Server 2008.

The conclusions he came up with in his test were that up until 15 records per partition the triangular Join was best. At that point the SQL CLR approach became better. The triangular join continued to outperform a bog standard TSQL cursor until the partition size reached 500.

Obviously your mileage may vary but I think these are useful numbers to know.

Martin Smith
avg records per ID are 41, max is 232 and min is 1
thomas
cant hurt to try it. I have about 600K records to test with.
thomas
Looks like I've been saved a job :-)
Martin Smith
+1  A: 

I assume you mean that you want to find, within an ID, the first day_dt for which the sum of previous day_dt is trans_count >= 10 or sales_amount >= 25. You call this found day the 'activated_dt'. You description is quite different from this because it does not specifies that you want only the first day, and it asks for sum of all previous days while your example result shows the sum up to the day.

I agree with Martin here that a running total would be the best performing one, as it could produce the result in a single scan of the table.

A result w/o running totals would have to compute the previous days totals for each day_dt and then pick the the first one for each ID:

with cte1 as (
select
  t.id,
  t.approval_dt,
  t.day_dt as activated_dt
from Table t
cross apply (
  select sum(trans_count) as sum_tc,
     sum(sale_amount) as sum_sa,
     max(day_dt) as max_day_dt
  from table c
  where c.id = t.id
  and c.day_dt <= t.day_dt) as p
where p.sum_tc >= 10
or p.sum_sa >=25)
, cte2 as (
  select id
   , approval_dt
   , activated_dt
   , row_number() over (partition by id order by activated_dt) as rn
  from cte1)
select *
from cte2
where rn = 1;
Remus Rusanu
this ran in 13 seconds on my 600K record test. It will run once per week so performance is not that big a deal and 13 seconds is certainly acceptable.
thomas
If you cluster index tha table by `(ID, DAY_DT)` I bet it will run much faster. Of course, you shouldn't cluster it for a criteria that is used only once a week. None the less, my guess would be that other queries are similar (looking for order IDs in date ranges) so I would consider and evaluate clustering the table like this. What is your current clustered index key?
Remus Rusanu
currently dont have any indexes (indicies?) as I am just starting this process. I will look into the clustered index you mention.
thomas
`sys.indexes` http://technet.microsoft.com/en-us/library/ms173760.aspx :) . So is 'indexes', may not be etymologically accurate, but is technically correct.
Remus Rusanu
If your table is a heap at the moment, then a clustered key like the one I suggest may be a good idea, not only for this query. I recommend going over the `Designing Indexes` chapter at http://msdn.microsoft.com/en-us/library/ms190804.aspx
Remus Rusanu
+1  A: 

Alright. Brace yourself. I'm using a little bit of a cheating method by using the running total UPDATE method. There are some quirks, so be weary of putting this in production code. The biggest of which is how the table is traversed. a Clustered index should probably be put on the APPROVAL_DAY column to ensure that the dates aren't split. Anyway, here goes.

CREATE TABLE #test
(
    ID   int,
    APPROVAL_DT   date,    
    DAY_DT        date,
    TRANS_COUNT     int,
    SALE_AMOUNT int,
    DailyTransCount int,
    DailySalesTotal int
)

INSERT INTO #test
SELECT 1,'2010-04-22','2010-04-27',2,260,0,0 UNION ALL
SELECT 1,'2010-04-22','2010-04-28', 1,40, 0,0 UNION ALL
SELECT 2,'2010-03-28','2010-04-02', 1,5, 0,0 UNION ALL
SELECT 2,'2010-03-28','2010-04-03', 5,10, 0,0 UNION ALL
SELECT 2,'2010-03-28','2010-04-04', 1,20, 0,0 UNION ALL
SELECT 3,'2010-04-25','2010-05-01', 6,10, 0,0 UNION ALL
SELECT 3,'2010-04-25','2010-05-02', 4,10, 0,0 UNION ALL
SELECT 4,'2010-06-01','2010-06-07', 1,5, 0,0

DECLARE @PreviousDay date; SET @PreviousDay = '29991231'
DECLARE @DailyTransCount int; SET @DailyTransCount = 0
DECLARE @DailySalesTotal int; SET @DailySalesTotal = 0
DECLARE @Group int; SET @Group = 0

UPDATE #test
    SET DailyTransCount = 0,
        DailySalesTotal = 0

UPDATE #test
    SET @DailyTransCount = DailyTransCount = CASE WHEN APPROVAL_DT = @PreviousDay THEN @DailyTransCount + Trans_Count ELSE Trans_Count END,
        @DailySalesTotal = DailySalesTotal = CASE WHEN APPROVAL_DT = @PreviousDay THEN @DailySalesTotal + SALE_AMOUNT ELSE SALE_AMOUNT END,
        @PreviousDay = APPROVAL_DT

SELECT Y.ID, X.APPROVAL_DT, X.DAY_DT FROM 
    (SELECT DISTINCT(ID) FROM #test T) Y
    LEFT JOIN ( SELECT ID, APPROVAL_DT, MIN(DAY_DT) AS DAY_DT FROM #test
                WHERE DailyTransCount >= 10 OR DailySalesTotal >= 25
                GROUP BY ID, APPROVAL_DT ) X ON X.ID = Y.ID

I should explain a few things: I have created two more columns on the end of my table. You'll need to put this into a temp table (or permanent table) to push out the totals into. After I push all of the totals into the columns, it's just a select to retrieve the results. There is more information on this technique here. Note that this solution is FAST, but has a bit of an unsafe-ness to it.

Mike M.
It is indeed fast but it relies on completely undocumented behaviour. Also see "the rules" section here http://www.sqlservercentral.com/articles/T-SQL/68467/.
Martin Smith
Of course. I thought I made that clear with my warnings.
Mike M.
+1 because the trick is still good to know and have in your toolbox. Do we need to worry about the year 3000 bug now? ;)
Remus Rusanu
@Mike - But you haven't adhered to the rules section which is why I pointed it out! No clustered index, no ` OPTION (MAXDOP 1)`
Martin Smith
@Martin - Thank you for putting the article so that others can read the rules. @Remus - we'll all have job security if we live till the year 2999 :)
Mike M.