views:

172

answers:

5

I need a query for SQl server 2005 (SQL server management studio express). I have data stored as 1 minute time frame (1 minute each row), for each table columns are ID, Symbol, DateTime, Open, High, Low, Close, Volume. I need to convert (compress) to every possibile multiple time frame, so let's say 10 minutes, 13, 15, and so on. Provide full details if somebody could help. Thanks Alberto

+3  A: 

Alberto, it looks like you need a "Group By" clause in SQL statements (as Leppie stated). So, you should better look for it.

First you should filter the rows that is subject for aggregation by using begin and end date/time and then group them by the mentioned clause.

Here is the first link when i search "sql group by" keywords via Google.

orka
+1  A: 

Not simple "Group By" - Open and Close values need taken for first and correspondingly last row in group. Or at least so is it for Forex data :)

Arvo
Yes correct, it is not a simple group by.
Alberto acepsut
+1  A: 
;WITH cte AS
(SELECT *,
        (32 * CAST([DATETIME] AS INT)) + DATEPART(HOUR,[DATETIME]) + (DATEPART(MINUTE,[DATETIME])/15)/4.0 AS Seg
     FROM     prices
     )
,cte1 AS
(
SELECT *,
        ROW_NUMBER() OVER (PARTITION BY Symbol,Seg ORDER BY [DATETIME])      AS RN_ASC ,
        ROW_NUMBER() OVER (PARTITION BY Symbol,Seg ORDER BY [DATETIME] DESC) AS RN_DESC
FROM cte
)     
SELECT 
      Symbol,
      Seg,
      MAX(CASE WHEN RN_ASC=1 THEN [DATETIME] END) AS OpenDateTime,
      MAX(CASE WHEN RN_ASC=1 THEN [OPEN] END) AS [OPEN],
      MAX(High) High,
      MIN(Low)  Low,
      SUM(Volume) Volume,
      MAX(CASE WHEN RN_DESC=1 THEN [CLOSE] END) AS [CLOSE],
      MAX(CASE WHEN RN_DESC=1 THEN [DATETIME] END) AS CloseDateTime
FROM cte1
GROUP BY Symbol,Seg
ORDER BY OpenDateTime

Or another approach that may be worth testing to see if it is any faster.

DECLARE @D1 DATETIME
DECLARE @D2 DATETIME
DECLARE @Interval FLOAT

SET @D1  = '2010-10-18 09:00:00.000'
SET @D2  = '2010-10-19 18:00:00.000'
SET @Interval = 15

;WITH 
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 A CROSS JOIN L0 B),
L2 AS (SELECT 1 AS c FROM L1 A CROSS JOIN L1 B),
L3 AS (SELECT 1 AS c FROM L2 A CROSS JOIN L2 B),
L4 AS (SELECT 1 AS c FROM L3 A CROSS JOIN L3 B),
Nums AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS i FROM L4),
Ranges AS(
SELECT 
      DATEADD(MINUTE,@Interval*(i-1),@D1) AS StartRange,
      DATEADD(MINUTE,@Interval*i,@D1) AS NextRange
FROM Nums where i <= 1+CEILING(DATEDIFF(MINUTE,@D1,@D2)/@Interval))
,cte AS (
SELECT 
     * 
     ,ROW_NUMBER() OVER (PARTITION BY Symbol,r.StartRange ORDER BY [DateTime])      AS RN_ASC 
     ,ROW_NUMBER() OVER (PARTITION BY Symbol,r.StartRange ORDER BY [DateTime] DESC) AS RN_DESC
FROM Ranges r
JOIN prices p ON p.[DateTime] >= r.StartRange and p.[DateTime] < r.NextRange )
SELECT 
      Symbol,
      MAX(CASE WHEN RN_ASC=1 THEN [DateTime] END) AS OpenDateTime,
      MAX(CASE WHEN RN_ASC=1 THEN [Open] END) AS [Open],
      MAX(High) High,
      MIN(Low)  Low,
      SUM(Volume) Volume,
      MAX(CASE WHEN RN_DESC=1 THEN [Close] END) AS [Close],
      MAX(CASE WHEN RN_DESC=1 THEN [DateTime] END) AS CloseDateTime
FROM cte
GROUP BY Symbol,StartRange
ORDER BY OpenDateTime
Martin Smith
Thanks Martin, but I get error: maybe I can provide you some sample xls data from my SQL db, I can imagine it is very hard to code this query properly without data. Can I attach file here? Otherwise acepsut is my Skype nick as well as my gmail.com account
Alberto acepsut
What error do you get? (If you want to put data up somewhere maybe Google spreadsheets would be a good place?)
Martin Smith
@Alberto - This is updated following comments beneath smirkingman's answer.
Martin Smith
Hi Martin,your edited version is working but skips some rows when used to aggregate all data in table http://img404.imageshack.us/img404/5096/f996.png2010-10-19 miss all data from 9:00 up to 16:44
Alberto acepsut
@Alberto What does the raw data look like for that symbol `WHERE [DateTime] >= '2010-10-18T17:30:00.000' AND [DateTime] < '2010-10-19T16:45:00.000'`?
Martin Smith
@Martin: http://img684.imageshack.us/img684/1379/f997.png and http://img440.imageshack.us/img440/1862/f998.pngand so on up to 17:30:00
Alberto acepsut
@Alberto - Ah, Find the three places with `CAST([DateTime] AS INT)` and replace with `(32 * CAST([DateTime] AS INT))`
Martin Smith
Thank you very much Martin now your query works perfectly and very fast, only 8 seconds to aggregate 237,000+rows. Thanks all for every help and support
Alberto acepsut
@Alberto- No worries. You'll probably find that you can speed this up if you make `seg` a computed column (`ALTER TABLE prices ADD Seg AS (32 * CAST([DateTime] AS INT)) + DATEPART(HOUR,[DateTime]) + (DATEPART(MINUTE,[DateTime])/15)/4.0`) then create a composite index on `Symbol,Seg ,datetime` columns. (`CREATE UNIQUE NONCLUSTERED INDEX IX_name ON prices(Symbol,Seg,DateTime)`)
Martin Smith
@Martin: your last query returns these errors:Message 2715 level 16 row 40 - Column,parameter or variable#1:impossible to find kind of data 'DateTime'.Invalid kind of data of this parameter or variable'@D1'.The same for '@D2'
Alberto acepsut
@Martin: your first (just edited) query now skips some rows
Alberto acepsut
@Alberto - You'll need to use ISO date format as before. You need to put some effort into solving these issues yourself! `SET @D1 = '2010-10-18T09:00:00.000' SET @D2 = '2010-10-19T18:00:00.000'`
Martin Smith
+1  A: 

Would be prettier with a stored proc to extract MIN(datetime) first, but here's a sketch:

WITH quarters(q) AS (
    SELECT DISTINCT
        15*CAST(DATEDIFF("n",'2000/01/01',dataora) / 15 as Int) AS primo
    FROM 
        Prezzi
)
SELECT
    simbolo, DATEADD("n",q,'2000/01/01') AS tick, 
        MIN(minimo) AS minimo, MAX(massimo) AS massimo,
        (SELECT 
            TOP 1 apertura FROM Prezzi P 
         WHERE 
            P.simbolo = simbolo AND 
            P.dataora >= DATEADD("n",q,'2000/01/01')
         ORDER BY
            P.dataora ASC
         ) as primaapertura,
        (SELECT 
            TOP 1 chiusura FROM Prezzi P 
         WHERE 
            P.simbolo = simbolo AND 
            P.dataora < DATEADD("s",14*60+59,DATEADD("n",q,'2000/01/01'))
         ORDER BY
            P.dataora DESC
         ) as ultimachiusara,
        SUM(volume) / COUNT(*) AS volumemedio
FROM
    quarters INNER JOIN Prezzi
    ON dataora BETWEEN DATEADD("n",q,'2000/01/01')
        AND DATEADD("s",14*60+59,DATEADD("n",q,'2000/01/01'))
GROUP BY
    simbolo, DATEADD("n",q,'2000/01/01')
ORDER BY 
    1, 2

The WITH clause gets a list of 15 minute intervals, rounded down, in your dataset (let's assume nothing before 2000). Then use those intervals to group by 14:59 interval. For the volume, you'll have to decide if you want average or the total.

The syntax might be a tad off, but you should get the idea.

EDIT: Adjusted MIN(open), MIN(close) to pick up FIRST and LAST. In reality this won't change much, as the concept of Open and Close depend on knowing the time difference between the exchange where the quote originated and the clock of the computer collecting the data.

In addition, unless the OP has the privilege of a real-time feed from all the exchanges, all the quotes are delayed by 20 minutes anyway.

EDIT(2): Quite right, FIRST and LAST are carry-overs from my IBM days >;-)

Solution now selects first and last quotes during the interval using TOP with ASC/DESC.

smirkingman
Why have you got `MIN(open)` and `MIN(close)`? The OP needs the `open` price related to the first record per segment and the `close` price related to the last record per segment.
Martin Smith
Laziness. His data is coming from a price feed. Open and close cannot change during the day, they're the stock price first thing this morning and last thing last night.
smirkingman
That does indeed make sense but that's not how the OP defines it in the comments. "Close(last price within this time interval) " @Alberto - Can you clarify?
Martin Smith
Hi all,I uploaded a picture to show how database looks likehttp://img706.imageshack.us/img706/2267/f987.png Some explains:exchanges starts at 9:00:00 and there are many trades from 9:00:00 and 9:00:59.All these exchanges are called "ticks".Apertura (Open) is first price traded as soon as market open, Chiusura (Close) is last traded price, Massimo (High) and Minimo (Low) are max and min price traded within this time frame.What you see in this picture are ticks aggregation,or compression,and are stored in 1 min time frame. I need a query to get the same for every other 1 min multiple t.frame
Alberto acepsut
So for a 15 min data aggregation,I should have 1 row each 15 1 minute rows,where Open will be the 9:00:00 Open value,Close will be 9:14:00 row close value,High the highest value from all 15 1 min rows High column,Low the lowest from all 15 1 min rows Low column and for Volume the sum of these 15 1 minute rows.This for all the whole stock,so first row will be 9:00:00, then 9:15:00, 9:30:00 and so on up to Time end (this market close at 17:30:00).I need also to set Time start and Time end since USA markets trades in different times
Alberto acepsut
Problems arise when some initial rows are missing.Let's suppose 9:00:00,9:01:00 and 9:02:00 are missing.Data starts in DataBase with 9:03:00. If I set Time start as 9:00:00 and 15 min. aggregation the query should get data from 9:03:00 up to 9:14:00, so it should always end considering 9:14:00 as last row,then print 9:15:00,9:30:00 and so on, no matter if 1 minute data starts with 9:00:00 or 9:03:00
Alberto acepsut
@smirkingman - There is no `FIRST` or `LAST` aggregate in SQL Server. @Alberto - Having looked at your data I don't see why my answer wouldn't work. Can you explain the error you mentioned?
Martin Smith
This is the error I getMessaggio 242, livello 16, stato 3, riga 2La conversione del tipo di dati da char a datetime ha generato un valore di tipo datetime non compreso nell'intervallo dei valori consentiti.TranslateMessage 242 level 16 row 2Data type conversion from char to datetime has generated a datetime kind of value not within the range of allowed values
Alberto acepsut
@Alberto - Are you talking about my answer? If so probably as I didn't use ISO date format. Try `WHERE [DateTime] >= '2010-10-18T09:00:00.000' AND [DateTime] < '2010-10-18T09:15:00.000'`
Martin Smith
Thanks Martin now it works correctly!Result is shown here http://img3.imageshack.us/img3/9532/f989.png Now is it possibile to do this for all data? Not just a single row but turn to 15 min tf all data for this table.Also,can you avoid to print last column? (ChiusuraDataOra)
Alberto acepsut
Guys: It will be easier to follow the discussion if your write comments under the solution to which they pertain
smirkingman
@Alberto: I have corrected my answer, which addresses your previous remark (all rows).
smirkingman
@smirkingman:Thank you very much,I will check immediately after lunch.Back soon!
Alberto acepsut
@smirkingman:it seems working correctly but it takes long times to complete.It is possible to speed up the whole process? Thank you
Alberto acepsut
DataBase name is StockDataFromSella
Alberto acepsut
The reason it's slow is probably due to the 2 clauses that calculate PrimaApertura and UltimaChiusara. As the Open/Close doesn't change during the day, would it not be simpler to remove them or add them afterwards? Try it without them and tell me if it's faster?
smirkingman
I run query without PrimaApertura and UltimaChiusura as in this picture here http://img177.imageshack.us/img177/321/f992.pngIt still takes a log time to execute the query, 18 minutes and 8 seconds as show in the lower right part of the picture
Alberto acepsut
You didn't remove them. the SELECT part should be "SELECT simbolo, DATEADD("n",q,'2000/01/01') AS tick, MIN(minimo) AS minimo, MAX(massimo) AS massimo, SUM(volume) / COUNT(*) AS volumemedio FROM" .... How many records are there in the input?
smirkingman
There are 237128 1 minutes rows. Now I have removed as from your last suggestion,running time is now only 3 seconds!http://img836.imageshack.us/img836/705/f993.png
Alberto acepsut
alora, tutto bene? >;-)
smirkingman
Ohh smirkingman,like to see you speak italian,it is easier for me!Your query is very good,but I need to have Open as well Close as faster as possibile. If I remove these columns the query execute at the speed of light, but how to get a quick query with all columns?
Alberto acepsut
I only have a smattering of Italian, it's a national language here in Switzerland. I suggest you use the fast query and then add in the Open/Close columns afterwards, as they don't change during the day.
smirkingman
No hope to have a quicker fully automated query to do it all without having to add Open and Close afterwards?Thank you very much indeed
Alberto acepsut
Come now, we've collectively spent quite a bit of time helping you to get 95% of your solution. It would seem reasonable that you finish the remaining 5% yourself.
smirkingman
You are right, thanks for all!
Alberto acepsut
It is considered gracious to accept one of the answers.
smirkingman
Yes smirkingman, I will accept one of the answer, I was just waiting for any new reply or edit
Alberto acepsut
A: 
    Declare @tbl1MinENI Table 
    (ID int identity,
     Simbolo char(3),
     DataOra datetime,
     Apertura numeric(15,4),
     Massimo  numeric(15,4),
     Minimo numeric(15,4),
     Chiusura numeric(15,4),
     Volume int)

    Insert Into  @tbl1MinENI (  Simbolo, DataOra, Apertura, Massimo, Minimo, Chiusura, Volume)
    Values
    ('ENI', '2010/10/18 09:00:00', 16.1100, 16.1800, 16.1100, 16.1400, 244015),
    ('ENI', '2010/10/18 09:01:00', 16.1400, 16.1400, 16.1300, 16.1400, 15692 ),
    ('ENI', '2010/10/18 09:02:00', 16.1400, 16.1500, 16.1400, 16.1500, 147035),
    ('ENI', '2010/10/18 09:03:00', 16.1500, 16.1600, 16.1500, 16.1600, 5181  ),
    ('ENI', '2010/10/18 09:04:00', 16.1600, 16.2000, 16.1600, 16.1900, 5134  ),
    ('ENI', '2010/10/18 09:05:00', 16.1900, 16.1900, 16.1800, 16.1800, 15040 ),
    ('ENI', '2010/10/18 09:06:00', 16.1900, 16.1900, 16.1600, 16.1600, 68867 ),
    ('ENI', '2010/10/18 09:07:00', 16.1600, 16.1600, 16.1600, 16.1600, 7606  ),
    ('ENI', '2010/10/18 09:08:00', 16.1500, 16.1500, 16.1500, 16.1500, 725   ),
    ('ENI', '2010/10/18 09:09:00', 16.1600, 16.1600, 16.1600, 16.1600, 81    ),
    ('ENI', '2010/10/18 09:10:00', 16.1700, 16.1800, 16.1700, 16.1700, 68594 ),
    ('ENI', '2010/10/18 09:11:00', 16.1800, 16.1800, 16.1800, 16.1800, 6619  )

    Declare @nRowsPerGroup int = 3

;With Prepare as
(
Select datediff(minute, '2010/10/18 09:00:00', DataOra)/@nRowsPerGroup as Grp,
       Row_Number() over (partition by datediff(minute, '2010/10/18 09:00:00', DataOra)/@nRowsPerGroup order by dataora) as rn,
       *
  From tbl1MinENI     
), b as
(
Select a.Grp, 
         Min(a.DataOra)          as GroupDataOra, 
         Min(ID) AperturaID,
         max(a.Massimo)          as Massimo, 
         Min(a.Minimo)           as Minimo, 
         max(id) ChiusuraID,
         sum(a.Volume)           as Volume
    From Prepare a
   Group by Grp
)
Select b.grp,
       b.GroupDataOra,
       ta.Apertura,
       b.Massimo,
       b.Minimo,
       tc.Chiusura,
       b.Volume
 From b
 Inner Join tbl1MinENI ta on ta.ID=b.AperturaID
 Inner Join tbl1MinENI tc on tc.ID=b.ChiusuraID
 ;   
Niikola
Thanks Nikola,I get some errors 1) Message 102 level 15 row 13 Incorrect sintax near ',' 2) Message 139 level 15 row impossibile to allocate a pre-defined value to a local variable 3) Message 137 level 15 row 30 Declare the scalar value "@nRowsPerGroup".
Alberto acepsut
This won't work if the record for 09:00:00 is missing (see earlier comments)
smirkingman
It works. Just remove insert for the 09:00:00 and try. In fact you can remove any number of rows you want it will still work.
Niikola