views:

49

answers:

2

Geetings, Stackers.

I have a huge number of data-points in a SQL table, and I want to summarise them in a way reminiscent of RRD.

Assuming a table such as

 ID | ENTITY_ID | SCORE_DATE | SCORE | SOME_OTHER_DATA
----+-----------+------------+-------+-----------------
  1 | A00000001 | 01/01/2010 |   100 | some data
  2 | A00000002 | 01/01/2010 |   105 | more data
  3 | A00000003 | 01/01/2010 |   104 | various text
... | ......... | .......... | ..... | ...
... | A00009999 | 01/01/2010 |   101 | 
... | A00000001 | 02/01/2010 |   104 | 
... | A00000002 | 02/01/2010 |   119 | 
... | A00000003 | 02/01/2010 |   119 | 
... | ......... | .......... | ..... | 
... | A00009999 | 02/01/2010 |   101 | arbitrary data
... | ......... | .......... | ..... | ...
... | A00000001 | 01/02/2010 |   104 | 
... | A00000002 | 01/02/2010 |   119 | 
... | A00000003 | 01/01/2010 |   119 | 

I want to end up with one record per entity, per month:

 ID | ENTITY_ID | SCORE_DATE | SCORE |
----+-----------+------------+-------+
... | A00000001 | 01/01/2010 |   100 |
... | A00000002 | 01/01/2010 |   105 |
... | A00000003 | 01/01/2010 |   104 |
... | A00000001 | 01/02/2010 |   100 |
... | A00000002 | 01/02/2010 |   105 |
... | A00000003 | 01/02/2010 |   104 |

(I Don't care about the SOME_OTHER_DATA - I'll pick something - either the first or last record probably.)

What's an easy way of doing this on a regular basis, so that anything in the last calendar month is summarised in this way?

At the moment my plan is kind of:

  • For each EntityID
    • For each month
      • Find average score for all records in given month
      • Update first record with results of previous step
      • Delete all records that aren't the first

I can't think of a neat way of doing it though, that doesn't involve lots of updates and iteration.

This can either be done in a SQL Stored Procedure, or it can be incorporated into the .Net app that's generating this data, so the solution doesn't really need to be "one big SQL script", but can be :)

(SQL-2005)

A: 

Give this a try:

--I am using @table variables here, you will want to use your actual table in place of @YourTable and a #Temptable for @YourTable2, with a PK on ID
SET NOCOUNT ON
DECLARE @YourTable table (ID int,ENTITY_ID char(9),SCORE_DATE datetime,SCORE int ,SOME_OTHER_DATA varchar(100))
DECLARE @YourTable2 table (ID int)
INSERT INTO @YourTable VALUES (1 , 'A00000001','01/01/2010',100,'some data')
INSERT INTO @YourTable VALUES (2 , 'A00000002','01/01/2010',105,'more data')
INSERT INTO @YourTable VALUES (3 , 'A00000003','01/01/2010',104,'various text')
INSERT INTO @YourTable VALUES (4 , 'A00009999','01/01/2010',101,null)
INSERT INTO @YourTable VALUES (5 , 'A00000001','02/01/2010',104,null)
INSERT INTO @YourTable VALUES (6 , 'A00000002','02/01/2010',119,null)
INSERT INTO @YourTable VALUES (7 , 'A00000003','02/01/2010',119,null)
INSERT INTO @YourTable VALUES (8 , 'A00009999','02/01/2010',101,'arbitrary data')
INSERT INTO @YourTable VALUES (9 , 'A00000001','01/02/2010',104,null)
INSERT INTO @YourTable VALUES (10, 'A00000002','01/02/2010',119,null)
INSERT INTO @YourTable VALUES (11, 'A00000003','01/01/2010',119,null)
SET NOCOUNT OFF

SELECT 'BEFORE',* FROM @YourTable ORDER BY ENTITY_ID,SCORE_DATE

UPDATE y
    SET SCORE=dt_a.AvgScore
    OUTPUT INSERTED.ID   --capture all updated rows
        INTO @YourTable2
    FROM @YourTable y
        INNER JOIN (SELECT --get avg score for each ENTITY_ID per month
                        ENTITY_ID
                            ,AVG(SCORE) as AvgScore
                            , DATEADD(month,DATEDIFF(month,0,SCORE_DATE),0) AS MonthOf,DATEADD(month,1,DATEADD(month,DATEDIFF(month,0,SCORE_DATE),0)) AS MonthNext
                        FROM @YourTable
                        --group by 1st day  of current month and 1st day of next month
                        --so an index can be used when joining derived table to UPDATE table
                        GROUP BY ENTITY_ID, DATEADD(month,DATEDIFF(month,0,SCORE_DATE),0),DATEADD(month,1,DATEADD(month,DATEDIFF(month,0,SCORE_DATE),0))
                   ) dt_a ON y.ENTITY_ID=dt_a.ENTITY_ID AND y.SCORE_DATE>=dt_a.MonthOf AND y.SCORE_DATE<dt_a.MonthNext
        INNER JOIN (SELECT--get first row for each ENTITY_ID per month
                        ID,ENTITY_ID,SCORE_DATE,SCORE
                        FROM (SELECT
                                  ID,ENTITY_ID,SCORE_DATE,SCORE
                                      ,ROW_NUMBER() OVER(PARTITION BY ENTITY_ID,DATEADD(month,DATEDIFF(month,0,SCORE_DATE),0) ORDER BY ENTITY_ID,SCORE_DATE) AS RowRank
                                  FROM @YourTable
                             ) dt
                        WHERE dt.RowRank=1
                   ) dt_f ON y.ID=dt_f.ID

DELETE @YourTable
    WHERE ID NOT IN (SELECT ID FROM @YourTable2)


SELECT 'AFTER ',* FROM @YourTable ORDER BY ENTITY_ID,SCORE_DATE

OUTPUT:

       ID          ENTITY_ID SCORE_DATE              SCORE       SOME_OTHER_DATA
------ ----------- --------- ----------------------- ----------- ----------------------------------------------------------------------------------------------------
BEFORE 1           A00000001 2010-01-01 00:00:00.000 100         some data
BEFORE 9           A00000001 2010-01-02 00:00:00.000 104         NULL
BEFORE 5           A00000001 2010-02-01 00:00:00.000 104         NULL
BEFORE 2           A00000002 2010-01-01 00:00:00.000 105         more data
BEFORE 10          A00000002 2010-01-02 00:00:00.000 119         NULL
BEFORE 6           A00000002 2010-02-01 00:00:00.000 119         NULL
BEFORE 3           A00000003 2010-01-01 00:00:00.000 104         various text
BEFORE 11          A00000003 2010-01-01 00:00:00.000 119         NULL
BEFORE 7           A00000003 2010-02-01 00:00:00.000 119         NULL
BEFORE 4           A00009999 2010-01-01 00:00:00.000 101         NULL
BEFORE 8           A00009999 2010-02-01 00:00:00.000 101         arbitrary data

(11 row(s) affected)

(8 row(s) affected)

(3 row(s) affected)

       ID          ENTITY_ID SCORE_DATE              SCORE       SOME_OTHER_DATA
------ ----------- --------- ----------------------- ----------- ----------------------------------------------------------------------------------------------------
AFTER  1           A00000001 2010-01-01 00:00:00.000 102         some data
AFTER  5           A00000001 2010-02-01 00:00:00.000 104         NULL
AFTER  2           A00000002 2010-01-01 00:00:00.000 112         more data
AFTER  6           A00000002 2010-02-01 00:00:00.000 119         NULL
AFTER  3           A00000003 2010-01-01 00:00:00.000 111         various text
AFTER  7           A00000003 2010-02-01 00:00:00.000 119         NULL
AFTER  4           A00009999 2010-01-01 00:00:00.000 101         NULL
AFTER  8           A00009999 2010-02-01 00:00:00.000 101         arbitrary data

(8 row(s) affected)
KM
A: 

This will give you averages for all of your data:

select ENTITY_ID, year(SCORE_DATE) as Year, month(SCORE_DATE) as Month, avg(SCORE) as Avg
from MyTable
group by ENTITY_ID, year(SCORE_DATE), month(SCORE_DATE)

To restrict to a given month, e.g., last February, you can do:

select ENTITY_ID, year(SCORE_DATE) as Year, month(SCORE_DATE) as Month, avg(SCORE) as Avg
from MyTable
where year(SCORE_DATE) = 2010 and month(SCORE_DATE) = 2
group by ENTITY_ID, year(SCORE_DATE), month(SCORE_DATE)

This version would actually perform better, but the parameters are a little less friendly to deal with:

select ENTITY_ID, year(SCORE_DATE) as Year, month(SCORE_DATE) as Month, avg(SCORE) as Avg
from MyTable
where SCORE_DATE >= '2/1/2010' and SCORE_DATE < '3/1/2010'
group by ENTITY_ID, year(SCORE_DATE), month(SCORE_DATE)

If you want a query that always returns last month's data, you can do this:

select ENTITY_ID, year(SCORE_DATE) as Year, month(SCORE_DATE) as Month, avg(SCORE) as Avg
from MyTable
where year(SCORE_DATE) = year(dateadd(month, -1, getdate())) and month(dateadd(month, -1, getdate())) = 2
group by ENTITY_ID, year(SCORE_DATE), month(SCORE_DATE)

A better-performing version:

select ENTITY_ID, year(SCORE_DATE) as Year, month(SCORE_DATE) as Month, avg(SCORE) as Avg
from MyTable
where SCORE_DATE >= dateadd(month, ((year(getdate()) - 1900) * 12) + month(getdate())-2, 0) 
    and SCORE_DATE < dateadd(month, ((year(getdate()) - 1900) * 12) + month(getdate())-1, 0)
group by ENTITY_ID, year(SCORE_DATE), month(SCORE_DATE)
RedFilter