views:

15770

answers:

7

I need to write a query that will group a large number of records by periods of time from Year to Hour.

My initial approach has been to decide the periods procedurally in C#, iterate through each and run the SQL to get the data for that period, building up the dataset as I go.

SELECT Sum(someValues)
FROM table1
WHERE deliveryDate BETWEEN @fromDate AND @ toDate

I've subsequently discovered I can group the records using Year(), Month() Day(), and datepart(week, date) and datepart(hh, date).

SELECT Sum(someValues)
FROM table1
GROUP BY Year(deliveryDate), Month(deliveryDate), Day(deliveryDate)

My concern is that using datepart in a group by will lead to worse performance than running the query multiple times for a set period of time due to not being able to use the index on the datetime field as efficiently; any thoughts as to whether this is true?

Thanks.

+2  A: 

As with anything performance related Measure

Checking the query plan up for the second approach will tell you any obvious problems in advance (a full table scan when you know one is not needed) but there is no substitute for measuring. In SQL performance testing that measurement should be done with appropriate sizes of test data.

Since this is a complex case, you are not simply comparing two different ways to do a single query but comparing a single query approach against a iterative one, aspects of your environment may play a major role in the actual performance.

Specifically

  1. the 'distance' between your application and the database as the latency of each call will be wasted time compared to the one big query approach
  2. Whether you are using prepared statements or not (causing additional parsing effort for the database engine on each query)
  3. whether the construction of the ranges queries itself is costly (heavily influenced by 2)
ShuggyCoUk
A: 

I think that you should benchmark it to get reliable results , but, IMHO and my first thought would be that letting the DB take care of it (your 2nd approach) would be much faster then when you do it in your client code. With your first approach, you have multiple roundtrips to the DB, which I think will be far more expensive. :)

Frederik Gheysels
+3  A: 

If you put a formula into the field part of a comparison, you get a table scan.

The index is on field, not on datepart(field), so ALL fields must be calculated - so I think your hunch is right.

Galwegian
There's no WHERE clause, so you're going to get a table scan anyway since it's going to look at every row.
Joe
+1  A: 

you could do something similar to this:

SELECT Sum(someValues)
FROM 
(
    SELECT *, Year(deliveryDate) as Y, Month(deliveryDate) as M, Day(deliveryDate) as D
    FROM table1
    WHERE deliveryDate BETWEEN @fromDate AND @ toDate
) t
GROUP BY Y, M, D
Mladen Prajdic
+2  A: 

If you can tolerate the performance hit of joining in yet one more table, I have a suggestion that seems odd but works real well.

Create a table that I'll call ALMANAC with columns like weekday, month, year. You can even add columns for company specific features of a date, like whether the date is a company holiday or not. You might want to add a starting and ending timestamp, as referenced below.

Although you might get by with one row per day, when I did this I found it convenient to go with one row per shift, where there are three shifts in a day. Even at that rate, a period of ten years was only a little over 10,000 rows.

When you write the SQL to populate this table, you can make use of all the date oriented built in functions to make the job easier. When you go to do queries you can use the date column as a join condition, or you may need two timestamps to provide a range for catching timestamps within the range. The rest of it is as easy as working with any other kind of data.

Walter Mitty
A: 

You may want to look at a dimensional approach (this is simliar to what Walter Mitty has suggested), where each row has a foreign key to a date and/or time dimension. This allows very flexible summations through the join to this table where these parts are precalculated. In these cases, the key is usually a natural integer key of the form YYYYMMDD and HHMMSS which is relatively performant and also human readable.

Another alternative might be indexed views, where there are separate expressions for each of the date parts.

Or calculated columns.

But performance has to be tested and execution plans examined...

Cade Roux
A: 

I was looking for similar solution for reporting purposes, and came across this article called Group by Month (and other time periods). It shows various ways, good and bad, to group by the datetime field. Definitely worth looking at.

alextansc