ansaurus

Question

Data Aggregation in SQL Server 2005

Answer 1

+3 A:

Alberto, it looks like you need a "Group By" clause in SQL statements (as Leppie stated). So, you should better look for it.

First you should filter the rows that is subject for aggregation by using begin and end date/time and then group them by the mentioned clause.

Here is the first link when i search "sql group by" keywords via Google.

orka 2010-10-18 12:47:30

Answer 2

+1 A:

Not simple "Group By" - Open and Close values need taken for first and correspondingly last row in group. Or at least so is it for Forex data :)

Arvo 2010-10-18 13:02:38

Yes correct, it is not a simple group by.

Alberto acepsut 2010-10-18 13:28:42

Answer 3

+1 A:

;WITH cte AS
(SELECT *,
        (32 * CAST([DATETIME] AS INT)) + DATEPART(HOUR,[DATETIME]) + (DATEPART(MINUTE,[DATETIME])/15)/4.0 AS Seg
     FROM     prices
     )
,cte1 AS
(
SELECT *,
        ROW_NUMBER() OVER (PARTITION BY Symbol,Seg ORDER BY [DATETIME])      AS RN_ASC ,
        ROW_NUMBER() OVER (PARTITION BY Symbol,Seg ORDER BY [DATETIME] DESC) AS RN_DESC
FROM cte
)     
SELECT 
      Symbol,
      Seg,
      MAX(CASE WHEN RN_ASC=1 THEN [DATETIME] END) AS OpenDateTime,
      MAX(CASE WHEN RN_ASC=1 THEN [OPEN] END) AS [OPEN],
      MAX(High) High,
      MIN(Low)  Low,
      SUM(Volume) Volume,
      MAX(CASE WHEN RN_DESC=1 THEN [CLOSE] END) AS [CLOSE],
      MAX(CASE WHEN RN_DESC=1 THEN [DATETIME] END) AS CloseDateTime
FROM cte1
GROUP BY Symbol,Seg
ORDER BY OpenDateTime

Or another approach that may be worth testing to see if it is any faster.

DECLARE @D1 DATETIME
DECLARE @D2 DATETIME
DECLARE @Interval FLOAT

SET @D1  = '2010-10-18 09:00:00.000'
SET @D2  = '2010-10-19 18:00:00.000'
SET @Interval = 15

;WITH 
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS i FROM L4),
Ranges AS(
SELECT 
      DATEADD(MINUTE,@Interval*(i-1),@D1) AS StartRange,
      DATEADD(MINUTE,@Interval*i,@D1) AS NextRange
FROM Nums where i <= 1+CEILING(DATEDIFF(MINUTE,@D1,@D2)/@Interval))
,cte AS (
SELECT 
     * 
     ,ROW_NUMBER() OVER (PARTITION BY Symbol,r.StartRange ORDER BY [DateTime])      AS RN_ASC 
     ,ROW_NUMBER() OVER (PARTITION BY Symbol,r.StartRange ORDER BY [DateTime] DESC) AS RN_DESC
FROM Ranges r
JOIN prices p ON p.[DateTime] >= r.StartRange and p.[DateTime] < r.NextRange )
SELECT 
      Symbol,
      MAX(CASE WHEN RN_ASC=1 THEN [DateTime] END) AS OpenDateTime,
      MAX(CASE WHEN RN_ASC=1 THEN [Open] END) AS [Open],
      MAX(High) High,
      MIN(Low)  Low,
      SUM(Volume) Volume,
      MAX(CASE WHEN RN_DESC=1 THEN [Close] END) AS [Close],
      MAX(CASE WHEN RN_DESC=1 THEN [DateTime] END) AS CloseDateTime
FROM cte
GROUP BY Symbol,StartRange
ORDER BY OpenDateTime

Martin Smith 2010-10-18 13:37:09

Thanks Martin, but I get error: maybe I can provide you some sample xls data from my SQL db, I can imagine it is very hard to code this query properly without data. Can I attach file here? Otherwise acepsut is my Skype nick as well as my gmail.com account

Alberto acepsut 2010-10-18 14:55:55

What error do you get? (If you want to put data up somewhere maybe Google spreadsheets would be a good place?)

Martin Smith 2010-10-18 14:58:23

@Alberto - This is updated following comments beneath smirkingman's answer.

Martin Smith 2010-10-19 11:28:56

Hi Martin,your edited version is working but skips some rows when used to aggregate all data in table http://img404.imageshack.us/img404/5096/f996.png2010-10-19 miss all data from 9:00 up to 16:44

Alberto acepsut 2010-10-20 08:00:22

@Alberto What does the raw data look like for that symbol `WHERE [DateTime] >= '2010-10-18T17:30:00.000' AND [DateTime] < '2010-10-19T16:45:00.000'`?

Martin Smith 2010-10-20 09:13:22

@Martin: http://img684.imageshack.us/img684/1379/f997.png and http://img440.imageshack.us/img440/1862/f998.pngand so on up to 17:30:00

Alberto acepsut 2010-10-20 09:29:44

@Alberto - Ah, Find the three places with `CAST([DateTime] AS INT)` and replace with `(32 * CAST([DateTime] AS INT))`

Martin Smith 2010-10-20 09:38:49

Thank you very much Martin now your query works perfectly and very fast, only 8 seconds to aggregate 237,000+rows. Thanks all for every help and support

Alberto acepsut 2010-10-20 09:52:04

@Alberto- No worries. You'll probably find that you can speed this up if you make `seg` a computed column (`ALTER TABLE prices ADD Seg AS (32 * CAST([DateTime] AS INT)) + DATEPART(HOUR,[DateTime]) + (DATEPART(MINUTE,[DateTime])/15)/4.0`) then create a composite index on `Symbol,Seg ,datetime` columns. (`CREATE UNIQUE NONCLUSTERED INDEX IX_name ON prices(Symbol,Seg,DateTime)`)

Martin Smith 2010-10-20 09:56:45

@Martin: your last query returns these errors:Message 2715 level 16 row 40 - Column,parameter or variable#1:impossible to find kind of data 'DateTime'.Invalid kind of data of this parameter or variable'@D1'.The same for '@D2'

Alberto acepsut 2010-10-20 11:53:19

@Martin: your first (just edited) query now skips some rows

Alberto acepsut 2010-10-20 11:55:13

@Alberto - You'll need to use ISO date format as before. You need to put some effort into solving these issues yourself! `SET @D1 = '2010-10-18T09:00:00.000' SET @D2 = '2010-10-19T18:00:00.000'`

Martin Smith 2010-10-20 12:27:05

Answer 4

+1 A:

Would be prettier with a stored proc to extract MIN(datetime) first, but here's a sketch:

WITH quarters(q) AS (
    SELECT DISTINCT
        15*CAST(DATEDIFF("n",'2000/01/01',dataora) / 15 as Int) AS primo
    FROM 
        Prezzi
)
SELECT
    simbolo, DATEADD("n",q,'2000/01/01') AS tick, 
        MIN(minimo) AS minimo, MAX(massimo) AS massimo,
        (SELECT 
            TOP 1 apertura FROM Prezzi P 
         WHERE 
            P.simbolo = simbolo AND 
            P.dataora >= DATEADD("n",q,'2000/01/01')
         ORDER BY
            P.dataora ASC
         ) as primaapertura,
        (SELECT 
            TOP 1 chiusura FROM Prezzi P 
         WHERE 
            P.simbolo = simbolo AND 
            P.dataora < DATEADD("s",14*60+59,DATEADD("n",q,'2000/01/01'))
         ORDER BY
            P.dataora DESC
         ) as ultimachiusara,
        SUM(volume) / COUNT(*) AS volumemedio
FROM
    quarters INNER JOIN Prezzi
    ON dataora BETWEEN DATEADD("n",q,'2000/01/01')
        AND DATEADD("s",14*60+59,DATEADD("n",q,'2000/01/01'))
GROUP BY
    simbolo, DATEADD("n",q,'2000/01/01')
ORDER BY 
    1, 2

The WITH clause gets a list of 15 minute intervals, rounded down, in your dataset (let's assume nothing before 2000). Then use those intervals to group by 14:59 interval. For the volume, you'll have to decide if you want average or the total.

The syntax might be a tad off, but you should get the idea.

EDIT: Adjusted MIN(open), MIN(close) to pick up FIRST and LAST. In reality this won't change much, as the concept of Open and Close depend on knowing the time difference between the exchange where the quote originated and the clock of the computer collecting the data.

In addition, unless the OP has the privilege of a real-time feed from all the exchanges, all the quotes are delayed by 20 minutes anyway.

EDIT(2): Quite right, FIRST and LAST are carry-overs from my IBM days >;-)

Solution now selects first and last quotes during the interval using TOP with ASC/DESC.

smirkingman 2010-10-18 15:18:28

Why have you got `MIN(open)` and `MIN(close)`? The OP needs the `open` price related to the first record per segment and the `close` price related to the last record per segment.

Martin Smith 2010-10-18 15:29:37

Laziness. His data is coming from a price feed. Open and close cannot change during the day, they're the stock price first thing this morning and last thing last night.

smirkingman 2010-10-18 15:40:53

That does indeed make sense but that's not how the OP defines it in the comments. "Close(last price within this time interval) " @Alberto - Can you clarify?

Martin Smith 2010-10-18 15:51:07

Hi all,I uploaded a picture to show how database looks likehttp://img706.imageshack.us/img706/2267/f987.png Some explains:exchanges starts at 9:00:00 and there are many trades from 9:00:00 and 9:00:59.All these exchanges are called "ticks".Apertura (Open) is first price traded as soon as market open, Chiusura (Close) is last traded price, Massimo (High) and Minimo (Low) are max and min price traded within this time frame.What you see in this picture are ticks aggregation,or compression,and are stored in 1 min time frame. I need a query to get the same for every other 1 min multiple t.frame

Alberto acepsut 2010-10-19 08:32:55

So for a 15 min data aggregation,I should have 1 row each 15 1 minute rows,where Open will be the 9:00:00 Open value,Close will be 9:14:00 row close value,High the highest value from all 15 1 min rows High column,Low the lowest from all 15 1 min rows Low column and for Volume the sum of these 15 1 minute rows.This for all the whole stock,so first row will be 9:00:00, then 9:15:00, 9:30:00 and so on up to Time end (this market close at 17:30:00).I need also to set Time start and Time end since USA markets trades in different times

Alberto acepsut 2010-10-19 08:41:42

Problems arise when some initial rows are missing.Let's suppose 9:00:00,9:01:00 and 9:02:00 are missing.Data starts in DataBase with 9:03:00. If I set Time start as 9:00:00 and 15 min. aggregation the query should get data from 9:03:00 up to 9:14:00, so it should always end considering 9:14:00 as last row,then print 9:15:00,9:30:00 and so on, no matter if 1 minute data starts with 9:00:00 or 9:03:00

Alberto acepsut 2010-10-19 08:48:04

@smirkingman - There is no `FIRST` or `LAST` aggregate in SQL Server. @Alberto - Having looked at your data I don't see why my answer wouldn't work. Can you explain the error you mentioned?

Martin Smith 2010-10-19 09:44:38

This is the error I getMessaggio 242, livello 16, stato 3, riga 2La conversione del tipo di dati da char a datetime ha generato un valore di tipo datetime non compreso nell'intervallo dei valori consentiti.TranslateMessage 242 level 16 row 2Data type conversion from char to datetime has generated a datetime kind of value not within the range of allowed values

Alberto acepsut 2010-10-19 10:04:47

@Alberto - Are you talking about my answer? If so probably as I didn't use ISO date format. Try `WHERE [DateTime] >= '2010-10-18T09:00:00.000' AND [DateTime] < '2010-10-18T09:15:00.000'`

Martin Smith 2010-10-19 10:53:07

Thanks Martin now it works correctly!Result is shown here http://img3.imageshack.us/img3/9532/f989.png Now is it possibile to do this for all data? Not just a single row but turn to 15 min tf all data for this table.Also,can you avoid to print last column? (ChiusuraDataOra)

Alberto acepsut 2010-10-19 11:11:34

Guys: It will be easier to follow the discussion if your write comments under the solution to which they pertain

smirkingman 2010-10-19 11:14:07

@Alberto: I have corrected my answer, which addresses your previous remark (all rows).

smirkingman 2010-10-19 11:14:58

@smirkingman:Thank you very much,I will check immediately after lunch.Back soon!

Alberto acepsut 2010-10-19 11:20:55

@smirkingman:it seems working correctly but it takes long times to complete.It is possible to speed up the whole process? Thank you

Alberto acepsut 2010-10-19 12:36:02

DataBase name is StockDataFromSella

Alberto acepsut 2010-10-19 12:40:36

The reason it's slow is probably due to the 2 clauses that calculate PrimaApertura and UltimaChiusara. As the Open/Close doesn't change during the day, would it not be simpler to remove them or add them afterwards? Try it without them and tell me if it's faster?

smirkingman 2010-10-19 13:31:33

I run query without PrimaApertura and UltimaChiusura as in this picture here http://img177.imageshack.us/img177/321/f992.pngIt still takes a log time to execute the query, 18 minutes and 8 seconds as show in the lower right part of the picture

Alberto acepsut 2010-10-19 14:11:06

You didn't remove them. the SELECT part should be "SELECT simbolo, DATEADD("n",q,'2000/01/01') AS tick, MIN(minimo) AS minimo, MAX(massimo) AS massimo, SUM(volume) / COUNT(*) AS volumemedio FROM" .... How many records are there in the input?

smirkingman 2010-10-19 14:22:02

There are 237128 1 minutes rows. Now I have removed as from your last suggestion,running time is now only 3 seconds!http://img836.imageshack.us/img836/705/f993.png

Alberto acepsut 2010-10-19 14:46:26

alora, tutto bene? >;-)

smirkingman 2010-10-19 15:05:50

Ohh smirkingman,like to see you speak italian,it is easier for me!Your query is very good,but I need to have Open as well Close as faster as possibile. If I remove these columns the query execute at the speed of light, but how to get a quick query with all columns?

Alberto acepsut 2010-10-19 15:16:47

I only have a smattering of Italian, it's a national language here in Switzerland. I suggest you use the fast query and then add in the Open/Close columns afterwards, as they don't change during the day.

smirkingman 2010-10-19 15:31:29

No hope to have a quicker fully automated query to do it all without having to add Open and Close afterwards?Thank you very much indeed

Alberto acepsut 2010-10-19 15:36:19

Come now, we've collectively spent quite a bit of time helping you to get 95% of your solution. It would seem reasonable that you finish the remaining 5% yourself.

smirkingman 2010-10-19 15:47:17

You are right, thanks for all!

Alberto acepsut 2010-10-19 15:50:37

It is considered gracious to accept one of the answers.

smirkingman 2010-10-19 20:14:18

Yes smirkingman, I will accept one of the answer, I was just waiting for any new reply or edit

Alberto acepsut 2010-10-20 07:47:38

Answer 5

A:

    Declare @tbl1MinENI Table 
    (ID int identity,
     Simbolo char(3),
     DataOra datetime,
     Apertura numeric(15,4),
     Massimo  numeric(15,4),
     Minimo numeric(15,4),
     Chiusura numeric(15,4),
     Volume int)

    Insert Into  @tbl1MinENI (  Simbolo, DataOra, Apertura, Massimo, Minimo, Chiusura, Volume)
    Values
    ('ENI', '2010/10/18 09:00:00', 16.1100, 16.1800, 16.1100, 16.1400, 244015),
    ('ENI', '2010/10/18 09:01:00', 16.1400, 16.1400, 16.1300, 16.1400, 15692 ),
    ('ENI', '2010/10/18 09:02:00', 16.1400, 16.1500, 16.1400, 16.1500, 147035),
    ('ENI', '2010/10/18 09:03:00', 16.1500, 16.1600, 16.1500, 16.1600, 5181  ),
    ('ENI', '2010/10/18 09:04:00', 16.1600, 16.2000, 16.1600, 16.1900, 5134  ),
    ('ENI', '2010/10/18 09:05:00', 16.1900, 16.1900, 16.1800, 16.1800, 15040 ),
    ('ENI', '2010/10/18 09:06:00', 16.1900, 16.1900, 16.1600, 16.1600, 68867 ),
    ('ENI', '2010/10/18 09:07:00', 16.1600, 16.1600, 16.1600, 16.1600, 7606  ),
    ('ENI', '2010/10/18 09:08:00', 16.1500, 16.1500, 16.1500, 16.1500, 725   ),
    ('ENI', '2010/10/18 09:09:00', 16.1600, 16.1600, 16.1600, 16.1600, 81    ),
    ('ENI', '2010/10/18 09:10:00', 16.1700, 16.1800, 16.1700, 16.1700, 68594 ),
    ('ENI', '2010/10/18 09:11:00', 16.1800, 16.1800, 16.1800, 16.1800, 6619  )

    Declare @nRowsPerGroup int = 3

;With Prepare as
(
Select datediff(minute, '2010/10/18 09:00:00', DataOra)/@nRowsPerGroup as Grp,
       Row_Number() over (partition by datediff(minute, '2010/10/18 09:00:00', DataOra)/@nRowsPerGroup order by dataora) as rn,
       *
  From tbl1MinENI     
), b as
(
Select a.Grp, 
         Min(a.DataOra)          as GroupDataOra, 
         Min(ID) AperturaID,
         max(a.Massimo)          as Massimo, 
         Min(a.Minimo)           as Minimo, 
         max(id) ChiusuraID,
         sum(a.Volume)           as Volume
    From Prepare a
   Group by Grp
)
Select b.grp,
       b.GroupDataOra,
       ta.Apertura,
       b.Massimo,
       b.Minimo,
       tc.Chiusura,
       b.Volume
 From b
 Inner Join tbl1MinENI ta on ta.ID=b.AperturaID
 Inner Join tbl1MinENI tc on tc.ID=b.ChiusuraID
 ;

Niikola 2010-10-19 12:27:44

Thanks Nikola,I get some errors 1) Message 102 level 15 row 13 Incorrect sintax near ',' 2) Message 139 level 15 row impossibile to allocate a pre-defined value to a local variable 3) Message 137 level 15 row 30 Declare the scalar value "@nRowsPerGroup".

Alberto acepsut 2010-10-19 12:39:41

This won't work if the record for 09:00:00 is missing (see earlier comments)

smirkingman 2010-10-19 13:33:58

It works. Just remove insert for the 09:00:00 and try. In fact you can remove any number of rows you want it will still work.

Niikola 2010-10-26 16:45:17

ansaurus

tags:

views:

answers:

Data Aggregation in SQL Server 2005

related questions