views:

115

answers:

5

This question is largely driven by curiosity, as I do have a working query (it just takes a little longer than I would like).

I have a table with 4 million rows. The only index on this table is an auto-increment BigInt ID. The query is looking for distinct values in one of the columns, but only going back 1 day. Unfortunately, the ReportDate column that is evaluated is not of the DateTime type, or even a BigInt, but is char(8) in the format of YYYYMMDD. So the query is a bit slow.

  SELECT Category 
    FROM Reports 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Note that the date converstion in the above statement is simply converting it to a YYYYMMDD format for comparison.

I was wondering if there was a way to optimize this query based on the fact that I know that the only data I am interested in is at the "bottom" of the table. I was thinking of some sort of recursive SELECT function which gradually grew a temporary table that could be used for the final query.

For example, in psuedo-sql:

N = 128
TemporaryTable = SELECT TOP {N} * 
                   FROM Reports 
               ORDER BY ID DESC 

/* Once we hit a date < Today, we can stop */
if(TemporaryTable does not contain ReportDate < Today) 
  N = N**2
  Repeat Select

/* We now have a smallish table to do our query */
  SELECT Category 
    FROM TemproaryTable 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Does that make sense? Is something like that possible?

This is on MS SQL Server 2008.

+3  A: 

Why not just create the index you need?

create index idx_Reports_ReportDate 
    on Reports(ReportDate, Category)
Joe Stefanelli
I am not the data owner - working on updating the index, but looking for alternative solutions (not to mention curious about alternatives in general)
Matt
+4  A: 

I might suggest you do not need to convert the Date that is stored as char data in YYYYMMDD format; That format is inherently sortable all by itself. I would instead convert your date to output in that format.

Also, the way you have the conversion written, it is converting the current DateTime for every individual row, so even storing that value for the whole query could speed things up... but I think just converting the date you are searching for to that format of char would help.

I would also suggest getting the index(es) you need created, of course... but that's not the question you asked :P

Andrew Barber
+3  A: 

No, that doesn't make sense. The only way to optimize this query is to have a covering index for it:

CREATE INDEX ndxReportDateCategory ON Reports (ReportDate, Category);

Update

Considering your comment that you cannot modify the schema, then you should modify the schema. If you still can't, then the answer still applies: the solution is to have an index.

And finally, to answer more directly your question, if you have a strong correlation between ID and ReportData: the ID you seek is the biggest one that has a ReportDate smaller than the date you're after:

SELECT MAX(Id) 
FROM Reports
WHERE ReportDate < 'YYYYMMDD';

This will do a reverse scan on the ID index and stop at the first ID that is previous to your desired date (ie. will not scan the entire table). You can then filter your reports base don this found max Id.

Remus Rusanu
Intersting - is this a reverse scan because it is looking at Max(ID)?
Matt
Yes. MAX(Column) where there is an index on Column can be satisfied by a quick look at either end of the index. IF you have in addition a filter, then a scan operator can quickly find the MAX (or MIN) by starting from the appropriate end and then going one by one and checking the condition. First record that passes the filter is the desired value.
Remus Rusanu
Note that this assumes there is a deterministic relation between Id and ReportDate, ie. no dates are reversed in the Id order.
Remus Rusanu
That is correct, they are deterministic
Matt
I'd love to give you the green check - but your answer is contradictory "The only way to optimize the query is to have a covering index for it", and then you have an answer, which optimizes the query without creating an index (query went from 20+sec to sub zero). Oh, and I get the point - creating indexes == good, but your rhetoric is getting in the way of your answer.
Matt
Others will google and land on this page. I prefer the rhetoric to be there and drive home the point that the solution to performance problems in relational tables is almost always in the shape of your schema (ie. a covering index), and almost never in the text of your query.
Remus Rusanu
A: 

One of the query patterns I occasionally use to get into a log table with similiar indexing to yours is to limit by subquery:

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

SELECT *
FROM
(
SELECT top 20000 *
FROM Reports
ORDER BY ID desc
) sub
WHERE sub.ReportDate = @ReportDate

20k/4M = 0.5% of the table is read.


Here's a loop solution. Note: might want to make ID primary key and Reportdate indexed in the temp table.

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

DECLARE @CurrentDate varchar(8), MinKey bigint


SELECT top 2000 * INTO #MyTable
FROM Reports ORDER BY ID desc

SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
FROM #MyTable

WHILE @ReportDate <= @CurrentDate
BEGIN

  SELECT top 2000 * INTO #MyTable
  FROM Reports WHERE ID < @MinKey ORDER BY ID desc

  SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
  FROM #MyTable

END

SELECT * FROM #MyTable
WHERE ReportDate = @ReportDate


DROP TABLE #MyTable
David B
That is sort of what I had in mind - but without necessarily knowing what the size of the subquery is ahead of time.
Matt
+1  A: 

I think you will find the discussion on SARGability, on Rob Farley's Blog to be very interesting reading in relation to your post topic.

http://sqlblog.com/blogs/rob_farley/archive/2010/01/21/sargable-functions-in-sql-server.aspx

An interesting alternative approach that does not require you to modify the existing column data type would be to leverage computed columns.

alter table REPORTS
add castAsDate as CAST(ReportDate as date)

create index rf_so2 on REPORTS(castAsDate) include (ReportDate)
John Sansom
+1, good read, thanks
Matt
You're most welcome.
John Sansom