views:

545

answers:

3

I have tables of data samples, with a timestamp and some data. Each table has a clustered index on the timestamp, and then a data-specific key. Data samples are not necessarily equidistant.

I need to downsample the data in a particular time range in order to draw graphs - say, going from 100,000 rows to N, where N is about 50. While I may have to compromise on the "correctness" of the algorithm from a DSP point of view, I'd like to keep this in SQL for performance reasons.

My current idea is to group samples in the time range into N boxes, and then take the average of each group. One way to achieve this in SQL is to apply a partition function to the date that ranges from 0 to N-1 (inclusive) and then GROUP BY and AVG.

I think that this GROUP BY can be performed without a sort, because the date is from a clustered index, and the partition function is monotone. However, SQL Server doesn't seem to notice this, and it issues a sort that represents 78% of the execution cost (in the example below). Assuming I'm right, and this sort is unnecessary, I could make the query 5 times faster.

Is there any way to force SQL Server to skip the sort? Or is there a better way to approach the problem?

Cheers. Ben

IF EXISTS(SELECT name FROM sysobjects WHERE name = N'test') DROP TABLE test

CREATE TABLE test
(
  date DATETIME NOT NULL,
  v FLOAT NOT NULL,
  CONSTRAINT PK_test PRIMARY KEY CLUSTERED (date ASC, v ASC)
)

INSERT INTO test (date, v) VALUES ('2009-08-22 14:06:00.000', 1)
INSERT INTO test (date, v) VALUES ('2009-08-22 17:09:00.000', 8)
INSERT INTO test (date, v) VALUES ('2009-08-24 00:00:00.000', 2)
INSERT INTO test (date, v) VALUES ('2009-08-24 03:00:00.000', 9)
INSERT INTO test (date, v) VALUES ('2009-08-24 14:06:00.000', 7)

-- the lower bound is set to the table min for demo purposes; in reality
-- it could be any date
declare @min float
set @min = cast((select min(date) from test) as float)

-- similarly for max
declare @max float
set @max = cast((select max(date) from test) as float)

-- the number of results to return (assuming enough data is available)
declare @count int
set @count = 3

-- precompute scale factor
declare @scale float
set @scale =  (@count - 1) / (@max - @min)
select @scale

-- this scales the dates from 0 to n-1
select (cast(date as float) - @min) * @scale, v from test

-- this rounds the scaled dates to the nearest partition,
-- groups by the partition, and then averages values in each partition
select round((cast(date as float) - @min) * @scale, 0), avg(v) from test
group by round((cast(date as float) - @min) * @scale, 0)
+1  A: 

Yes, SQL-Server has always had some problems with this kind of time-partitioning summary SELECTs. Analysis Services has a variety of ways to handle it, but the Data Servies side is more limited.

What I would suggest you try (I cannot try or test anything from here) is to make a secondary "partition table" that contains yor partition definitions and then join against it. You will need some mathcing indexes for his to have a chance to work:

RBarryYoung
+2  A: 

There is really no way SQL Server would know that the date clustered key can be used for an expression like round(cast.. as float)) to guarantee the order. Only that and would throw it off the track. Add in the (... -@min) * @scale and you got yourself a perfect mess. If you need to sort and group by such expressions, have them stored in persisted computed columns and index by them. You probably want to use DATEPART though as going through an imprecise type like float is likely to render the expression unusable for a persisted computed columns.

Update

On the topic of date and float being equivalent:

declare @f float, @d datetime;
select @d = cast(1 as datetime);
select @f = cast(1 as float);
select cast(@d as varbinary(8)), cast(@f as varbinary(8)), @d, cast(@d as float)

Produces this:

0x0000000100000000  0x3FF0000000000000 1900-01-02 00:00:00.000 1

So you can see that altough they are both stored on 8 bytes (at least the float(25...53)), the internal representation of datetime is not a float with integer part being day and fractional part being time (as is often assumed).

To give another example:

declare @d datetime;
select @d = '1900-01-02 12:00 PM';
select cast(@d as varbinary(8)), cast(@d as float)

0x0000000100C5C100  1.5

Again the result of casting @d to float is 1.5, but the datetime internal representation of 0x0000000100C5C100 would be the IEEE double value 2.1284E-314, not 1.5.

Remus Rusanu
In this example it should be quite easy to analyze at least the (... -@min) * @scale part. Unfortunately storing the "date" column as a float doesn't seem to make a difference.Ultimately though, you are right: it's a bit optimistic to expect SQL Server to solve this automatically. What I'm really hoping for is a way to tell it to assume that the data is already sorted. :)With respect to FLOAT being imprecise, I thought that DATETIME is just a FLOAT internally?
Ben Challenor
See my update on the date and float 'internal' assumption.
Remus Rusanu
Ah, that's very interesting! Thanks.
Ben Challenor
A: 

Two questions:

How long does this query take?

And are you sure that it is sorting the date? Also where in the plan is it sorting the date? After it partitions? That would be my guess. I would doubt it's like the first thing it does... Maybe the way it paritions or groups it needs to do a sort again.

Anyways, even if it did sort an already sorted list, it would not think that it would take very long because it is alredy sorted...

kralco626