views:

2274

answers:

8

I am wondering if there is a good-performing query to select distinct dates (ignoring times) from a table with a datetime field in SQL Server.

My problem isn't getting the server to actually do this (I've seen this question already, and we had something similar already in place using DISTINCT). The problem is whether there is any trick to get it done more quickly. With the data we are using, our current query is returning ~80 distinct days for which there are ~40,000 rows of data (after filtering on another indexed column), there is an index on the date column, and the query always manages to take 5+ seconds. Which is too slow.

Changing the database structure might be an option, but a less desirable one.

+4  A: 

I've used the following costruct ...

CAST(FLOOR(CAST(@date as FLOAT)) as DateTime);

this removes the time from the date by converting it to a float and truncating off the "time" part, which is the decimal of the float.

Looks a little clunky but works well on a large dataset (~100,000 rows) I use repeatedly throughout the day.

Scott Vercuski
+3  A: 

The simplest way is to add a computed column for just the date portion, and select on that. You could do this in a view if you don't want to change the table.

Joel Coehoorn
A: 

Update:

Solution below tested for efficiency on a 2M table and takes but 40 ms.

Plain DISTINCT on an indexed computed column took 9 seconds.

See this entry in my blog for performance details:


Unfortunately, SQL Server's optimizer can do neither Oracle's SKIP SCAN nor MySQL's INDEX FOR GROUP-BY.

It's always Stream Aggregate that takes long.

You can built a list of possible dates using a recursive CTE and join it with your table:

WITH    rows AS (
        SELECT  CAST(CAST(CAST(MIN(date) AS FLOAT) AS INTEGER) AS DATETIME) AS mindate, MAX(date) AS maxdate
        FROM    mytable
        UNION ALL
        SELECT  mindate + 1, maxdate
        FROM    rows
        WHERE   mindate < maxdate
        )
SELECT  mindate
FROM    rows
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    mytable
        WHERE   date >= mindate
                AND date < mindate + 1
        )
OPTION  (MAXRECURSION 0)

This will be more efficient than Stream Aggregate

Quassnoi
A: 

If you want to avoid the step extraction or reformatting the date - which is presumably the main cause of the delay (by forcing a full table scan) - you've no alternative but to store the date only part of the datetime, which unfortunately will require an alteration to the database structure.

If your using SQL Server 2005 or later then a persisted computed field is the way to go

Unless otherwise specified, computed columns are virtual columns that are
not physically stored in the table. Their values are recalculated every 
time they are referenced in a query. The Database Engine uses the PERSISTED 
keyword in the CREATE TABLE and ALTER TABLE statements to physically store 
computed columns in the table. Their values are updated when any columns 
that are part of their calculation change. By marking a computed column as 
PERSISTED, you can create an index on a computed column that is deterministic
but not precise. 
Cruachan
The main cause of the delay is the scan and sort to produce the distinct. Unless something *extreamly* complex occur in a scalar operation, the delays in a database are always related to data access, not to scalar operations.
Remus Rusanu
It's the main cause of the delay because it forces a full table scan - sorry, should have made that clear
Cruachan
A: 

What is your predicate on that other filtered column ? Have you tried whether you get improvement from an index on that other filtered column, followed by the datetime field ?

I'm largely guessing here, but 5 seconds to filter a set of perhaps 100000 rows down to 40000 and then doing a sort (which is presumably what goes on) doesn't seem like an unreasonable time to me. Why do you say it's too slow ? Because it doesn't match expectations ?

+1  A: 

Every option that involves CAST or TRUNCATE or DATEPART manipulation on the datetime field has the same problem: the query has to scan the entire resultset (the 40k) in order to find the distinct dates. Performance may vary marginally between various implementaitons.

What you really need is to have an index that can produce the response in a blink. You can either have a persisted computed column with and index that (requires table structure changes) or an indexed view (requires Enterprise Edition for QO to consider the index out-of-the-box).

Persisted computed column:

alter table foo add date_only as convert(char(8), [datetimecolumn], 112) persisted;
create index idx_foo_date_only on foo(date_only);

Indexed view:

create view v_foo_with_date_only
with schemabinding as 
select id
    , convert(char(8), [datetimecolumn], 112) as date_only
from dbo.foo;   
create unique clustered index idx_v_foo on v_foo_with_date_only(date_only, id);

Update

To completely eliminate the scan one could use an GROUP BY tricked indexed view, like this:

create view v_foo_with_date_only
with schemabinding as 
select
    convert(char(8), [d], 112) as date_only
    , count_big(*) as [dummy]
from dbo.foo
group by convert(char(8), [d], 112)

create unique clustered index idx_v_foo on v_foo_with_date_only(date_only)

The query select distinct date_only from foo will use this indexed view instead. Is still a scan technically, but on an already 'distinct' index, so only the needed records are scanned. Its a hack, I reckon, I would not recommend it for live production code.

AFAIK SQL Server does not have the capability of scanning a true index with skipping repeats, ie. seek top, then seek greater than top, then succesively seek greater than last found.

Remus Rusanu
Is there any way to use `SKIP SCAN` in `SQL Server`? I just tried your solution on a `2M` table and it got even worse (`DISTINCT CAST(...)` on a `DATETIME` field took `850 ms` with a `Hash Match Aggregate`, `DISTINCT date` took `1800 ms` with a `Stream Aggregate` ). `Oracle` and `MySQL` would just jump over the distinct fields in index, `SQL Server` doesn't do it.
Quassnoi
You need to select distinct date_only after an index was created on it.
Remus Rusanu
`@Remus`: I did create an index, and the optimizer did use it.
Quassnoi
In my tests with 2M records was much faster, a scan and an aggregate. The SKIP SCAN is something else though, if you have index idx1 on (ColA, ColB) and a predicate on ColB 'skip scan' will use the idx1 despite the fact that ColA has no predicate. True though there is no SKIP SCAN in SQL Server.
Remus Rusanu
+1  A: 

I'm not sure why your existing query would take over 5s for 40,000 rows.

I just tried the following query against a table with 100,000 rows and it returned in less than 0.1s.

SELECT DISTINCT DATEADD(day, 0, DATEDIFF(day, 0, your_date_column))
FROM your_table

(Note that this query probably won't be able to take advantage of any indexes on the date column, but it should be reasonably quick, assuming that you're not executing it dozens of times per second.)

LukeH
A: 

Just convert the date: dateadd(dd,0, datediff(dd,0,[Some_Column]))

Jeff O