ansaurus

Question

SQL Server efficiently filter rows where times are not near another table's times

Answer 1

A:

Take off both indexes and try again

kevchadders 2010-09-15 15:42:36

As expected, no improvement. Why did you think that might help?

Michael J Swart 2010-09-15 15:52:25

Due to the fact your only using two fields per table and one of them is a primary key, I doubt those indexes will help in any way with speed improvement so they are taking up unnecessary space.

kevchadders 2010-09-15 15:58:27

Answer 2

A:

my first suggestion is to give this to one of the developers and have them code up an algorithm in C or C#

otherwise here is an idea. take the original data in the table and create new rows for plus and minus one minute. might be a lot of data if you're using seconds. then compare it with the data from the second table like you were doing

Alen 2010-09-15 15:57:45

Answer 3

+4 A:

The LEFT JOIN/IS NULL isn't as efficient on SQL Server as NOT IN or NOT EXISTS when columns are not nullable - see this link for details.

That said, this:

SELECT t1.id,
       t1.value
  FROM temp1 t1
 WHERE NOT EXISTS(SELECT NULL
                    FROM temp2 t2
                   WHERE t2.value BETWEEN DATEADD(MINUTE, -1, t1.value)  
                                      AND DATEADD(MINUTE, 1, t1.value))

...still has a problem in that function use (IE: DATEADD) renders the index useless. You're altering the data of the column (temporarily, without writing it back to the table) while the index is on the original value.

I'm at a loss for options if you want the precision. Otherwise, if you alter the datetime before it's inserted into the temp table then you gain:

ability to straight compare: t1.value = t2.value
ability to use the index, assuming optimizer believes it can be of use

OMG Ponies 2010-09-15 16:02:22

unfortunately the precision is needed.

Michael J Swart 2010-09-15 16:04:52

@Michael J Swart: I figured as much, but thought I'd put it out there.

OMG Ponies 2010-09-15 16:11:11

But I may be able to weed out many by calculating the time to the minute. Times that are part of the same minute are within a minute of eachother. This would give a smaller list to work with (which may help a bit)

Michael J Swart 2010-09-15 16:12:13

+1 The Execution Plan for this is fine!

Martin Smith 2010-09-15 17:56:51

Martin, you're absolutely right.

Michael J Swart 2010-09-15 19:59:37

Answer 4

+2 A:

This seems to do it pretty quick:

SELECT t.id,
       t.value
FROM 
(
   SELECT t1.id, 
          t1.value, 
          (SELECT MIN(temp2.value) FROM temp2 WHERE temp2.value >= t1.value) as theNext, 
          (SELECT MAX(temp2.value) FROM temp2 WHERE temp2.value <= t1.value) as thePrev
   FROM temp1 t1
) t 
WHERE DATEDIFF(second, t.value, t.theNext) > 60 
  AND DATEDIFF(second, t.thePrev, t.value) > 60

and it doesn't require any restructure of your tables.

Make sure and use seconds for the comparison, since minutes will get rounded. This runs in less than a second on my machine using your specifications for table creation.

EDIT: Added <= and >= to theNext and thePrev calculations. This prevents a false positive where temp1.value is equal to temp2.value.

md5sum 2010-09-15 16:08:44

It doesn't return the same results. On my test data 2388 vs 6126 rows.

Martin Smith 2010-09-15 16:15:00

This doesn't return the same result set as the original query when I tried it on my machine, using the OPs tables.

Joe Stefanelli 2010-09-15 16:16:05

@Martin - if there's that big a difference, you probably caught it before I updated it to use seconds instead of minutes. Minutes get rounded and screw with the calculation. Also edited my answer to fix the smaller number of false matches.

md5sum 2010-09-15 16:47:36

@md5sum - Yes that returns the same number of rows now and seems quite fast as well (though I think mine might have the edge!). (+1)

Martin Smith 2010-09-15 16:59:16

@Martin - I dunno... :-P Both seem to run real close to ~300ms on my machine for 2400 rows returned... I'm not too picky past that point usually.

md5sum 2010-09-15 17:05:08

I was basing my claim on the logical reads count with `set statistics io on` but I guess it's entirely possible for different data distributions the results may be different.

Martin Smith 2010-09-15 17:10:46

You bet, this solution is also amazing. I'm a lucky question asker, I have two great answers to pick from. But I have to have some way to break ties and so logical reads is the tiebreaker for me too. I'm afraid Martin Smith's answer has the edge.

Michael J Swart 2010-09-15 17:43:17

Answer 5

+2 A:

Answer Rewritten

For your original query changing the Join condition from

LEFT JOIN temp2 t2
 ON t1.value BETWEEN DATEADD(MINUTE, -1, t2.value) AND DATEADD(MINUTE, 1, t2.value)

to

LEFT JOIN temp2 t2
 ON t2.value BETWEEN DATEADD(MINUTE, -1, t1.value) AND DATEADD(MINUTE, 1, t1.value)

Makes a huge difference.

In both it has a scan on temp1 as the outer input to the nested loops iterator. However for the first one the condition on temp2 is not sargable so it needs to do a scan on the whole of temp2 for each row in temp1. For the second version it can do a much more reasonable range seek on the index to retrieve the matching row(s).

However the Not Exists solution as per @OMG's answer is more efficient in SQL Server

Execution Plans:

(Ignore the "Cost Relative to the Batch" for the second one - The estimated rows are way off actual so this figure is misleading)

Martin Smith 2010-09-15 16:46:15

@Martin - Nice solution! +1

md5sum 2010-09-15 17:00:09

Very nice indeed!

Michael J Swart 2010-09-15 17:27:14

@Michael - I had discounted the simple `NOT EXISTS` solution from @OMG Ponies answer due to the commentary that went with it but I've now tested it and found that is better than mine as it saves a scan on `temp1`!

Martin Smith 2010-09-15 17:55:59

... Also your original query can be massively improved by changing the `JOIN` to `LEFT JOIN temp2 t2 ON t2.value BETWEEN DATEADD(MINUTE, -1, t1.value) AND DATEADD(MINUTE, 1, t1.value)`. Though it is still beaten by the others.

Martin Smith 2010-09-15 18:03:11

Answer 6

A:

I dealt with an issue similar to this by converting the DateTime value to the integer number of minutes since 1/1/2000--and writing that value into a column in my database table. So (in your case) the table would look like this:

create table temp2
(
    id int identity primary key,
    timeValue int not null
)

To compare with this table, just convert your comparison value to the integer number of minutes (I use a user-defined function for this) and compare.

DECLARE @newTime int;
SET @newTime = dbo.fnGetComparisonTime(@DateTimeValue)

Then retrieve your data:

SELECT id, timeValue 
FROM temp2
WHERE timeValue NOT BETWEEN (@newTime - 1) AND @newTime;

And the convert time to integer minutes function?

CREATE FUNCTION dbo.fnGetComparisonTime
    (
        @DateTimeValue datetime
    )
RETURNS int
AS
BEGIN
    -- Declarations
    DECLARE @Output int
    DECLARE @StartDate datetime

    SET @StartDate = '2000-01-01 00:00:00'
    SET @Output = DATEDIFF(minute, @StartDate, @ReportDateTime)    

    -- And we're done!
    RETURN @Output

END

You can play around with the SELECT statement, of course, to get the results you want. Converting DateTime values to minutes is WAY faster than handling this with dates directly.

You might ask--is there a Y2K kind of problem with this? (You will, after all, run out of minutes at 31^2 - 1 minutes.) Yes--in about 7000 years. Be sure to document your code carefully....

John Murdoch 2010-09-15 16:58:12

I'm personally really not fond of this type of solution. I understand that it will perfectly well work, but it requires a lot more understanding of the system for a new developer or contractor to begin working on your system, and it reduces maintainability within the codebase. There are plenty of better, more understandable, and easily maintained methods to perform this same task. On top of all that, you still have to build the same complexity of base query implementing this new methodology in order to gather the same data. You're simply changing the comparison type.

md5sum 2010-09-15 17:18:35

Answer 7

A:

This is really fast..

;WITH Time_CTE (ID,Table1_Time,Table2_Time) AS ( SELECT t1.id, t1.value AS Table1_Time, t2.value AS Table2_Time FROM temp1 t1 INNER JOIN temp2 t2 ON YEAR(t1.value) = YEAR(t2.value) AND MONTH(t1.value) = MONTH(t2.value) AND DAY(t1.value) = DAY(t2.value) )

SELECT TCTE.id, TCTE.Table1_Time FROM Time_CTE TCTE WHERE DATEDIFF(ss,Table1_Time,Table2_Time) < 61 OR DATEDIFF(ss,Table2_Time,Table1_Time) < 61

Ryan Cooper 2010-09-15 17:02:31

Think there must be an error in it (I cancelled execution after 32 seconds by which time it had output 870,000 rows)

Martin Smith 2010-09-15 17:07:33

I'm sorry... this actually runs slower than the solution that the OP posted... and after 2:52 on my machine it finally threw a System.OutOfMemoryException... with 16777216 rows...

md5sum 2010-09-15 17:11:09

ansaurus

tags:

views:

answers:

SQL Server efficiently filter rows where times are not near another table's times

Execution Plans:

related questions