I have tables of data samples, with a timestamp and some data. Each table has a clustered index on the timestamp, and then a data-specific key. Data samples are not necessarily equidistant.
I need to downsample the data in a particular time range in order to draw graphs - say, going from 100,000 rows to N, where N is about 50. While I may have to compromise on the "correctness" of the algorithm from a DSP point of view, I'd like to keep this in SQL for performance reasons.
My current idea is to group samples in the time range into N boxes, and then take the average of each group. One way to achieve this in SQL is to apply a partition function to the date that ranges from 0 to N-1 (inclusive) and then GROUP BY and AVG.
I think that this GROUP BY can be performed without a sort, because the date is from a clustered index, and the partition function is monotone. However, SQL Server doesn't seem to notice this, and it issues a sort that represents 78% of the execution cost (in the example below). Assuming I'm right, and this sort is unnecessary, I could make the query 5 times faster.
Is there any way to force SQL Server to skip the sort? Or is there a better way to approach the problem?
Cheers. Ben
IF EXISTS(SELECT name FROM sysobjects WHERE name = N'test') DROP TABLE test
CREATE TABLE test
(
date DATETIME NOT NULL,
v FLOAT NOT NULL,
CONSTRAINT PK_test PRIMARY KEY CLUSTERED (date ASC, v ASC)
)
INSERT INTO test (date, v) VALUES ('2009-08-22 14:06:00.000', 1)
INSERT INTO test (date, v) VALUES ('2009-08-22 17:09:00.000', 8)
INSERT INTO test (date, v) VALUES ('2009-08-24 00:00:00.000', 2)
INSERT INTO test (date, v) VALUES ('2009-08-24 03:00:00.000', 9)
INSERT INTO test (date, v) VALUES ('2009-08-24 14:06:00.000', 7)
-- the lower bound is set to the table min for demo purposes; in reality
-- it could be any date
declare @min float
set @min = cast((select min(date) from test) as float)
-- similarly for max
declare @max float
set @max = cast((select max(date) from test) as float)
-- the number of results to return (assuming enough data is available)
declare @count int
set @count = 3
-- precompute scale factor
declare @scale float
set @scale = (@count - 1) / (@max - @min)
select @scale
-- this scales the dates from 0 to n-1
select (cast(date as float) - @min) * @scale, v from test
-- this rounds the scaled dates to the nearest partition,
-- groups by the partition, and then averages values in each partition
select round((cast(date as float) - @min) * @scale, 0), avg(v) from test
group by round((cast(date as float) - @min) * @scale, 0)