views:

279

answers:

6

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.

I can do this using ROW_NUMBER as follows:

WITH TempTable AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
    FROM dbo.DataTable
)

SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous 
    ON [Current].Machine_ID = Previous.Machine_ID
    AND Previous.Ordering = [Current].Ordering + 1

The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.

Is there anyway to rewrite this query to go faster?

+4  A: 

The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:

CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);

Separate indexes on Machine_ID and Date_Time will help little, if any.

Remus Rusanu
As I said, I created that index as well, and it did not improve the query performance at all.
BlueRaja - Danny Pflughoeft
That is because your * triggers the index tipping point. Restrict it to only the columns needed and use include the make the non-clustered index covering. If too many columns are needed, then it has to be changed to a clustered index, with all the consequences.
Remus Rusanu
@Remus: You seem to be correct, removing the * decreases the time of the query to just a few seconds. I can't envision why this would happen - could you provide any links as to what an *index tipping point* is?
BlueRaja - Danny Pflughoeft
Well now you know one of the reasons why select * should not be in a production query. Plus you have joins so you were returning unnecessary columns by definition.
HLGEM
http://sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx
Remus Rusanu
A: 

What if you use a trigger to store the last timestamp an subtract each time to get the difference?

belisarius
Unfortunately, it is historical data, and not always added in order.
BlueRaja - Danny Pflughoeft
+2  A: 

I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.

I would try this before going any further with using a CTE.

I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).

Update

After your comment that this is a view, I would first test the query with a temp table to see if that performs better.

If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.

Phil Sandler
How would I do that? Would I need to replace the view with an Sproc (since views can't have variables)?
BlueRaja - Danny Pflughoeft
Yes, I wasn't clear it was a view from your question. See the update to my answer (will follow in a few minutes).
Phil Sandler
A: 

If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?

(Remus' compound index will make the query fast; running it only once should make it faster still.)

Philip Kelley
+3  A: 

If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here

Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)

Paul McLoughlin
I second this approach. CTEs are simply inline rewrites. Just like repeating your own code and self-joining, there is nothing to guarantee that the optimizer will spool it into a temporary table. If you put things in your own table, you can choose indexes and/or avoid double work. Having said that, I do use CTEs where code maintenance is important and where the schema is liable to change very quickly (or in views, like this case).
Cade Roux
+3  A: 

How does it compare to this version?:

SELECT x.*
    ,(SELECT MAX(Date_Time)
      FROM dbo.DataTable
      WHERE Machine_ID = x.Machine_ID
          AND Date_Time < x.Date_Time
    ) AS PreviousDateTime
FROM dbo.DataTable AS x

Or this version?:

SELECT x.*
    ,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
    SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
    FROM dbo.DataTable AS l
    LEFT JOIN dbo.DataTable AS r
    ON l.Machine_ID = r.Machine_ID
        AND l.Date_Time > r.Date_Time
    GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
    AND triang_join.Date_Time = x.Date_Time

Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.

You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...

Cade Roux
The second query finishes in seconds rather than minutes, but the first query executes faster than I can time it. Perfect - thanks!
BlueRaja - Danny Pflughoeft