views:

371

answers:

9

I have a table of events, I need to find all tail events of type 1 and all head events of type 1.

So, for the set of events in this order [1, 1], 3, 1 ,4, 5, [1,1,1] the brackets denote head and tail events of type 1.

This is much better illustrated in SQL:

drop table #event
go 
create table #event (group_id int, [date] datetime, [type] int)
create index idx1 on #event (group_id, date)


insert into #event values (1, '2000-01-01', 1) 
insert into #event values (1, '2000-01-02', 1) 
insert into #event values (1, '2000-01-03', 3) 
insert into #event values (1, '2000-01-04', 2) 
insert into #event values (1, '2000-01-05', 1) 
insert into #event values (2, '2000-01-01', 2) 
insert into #event values (2, '2000-01-02', 2) 
insert into #event values (2, '2000-01-03', 3) 
insert into #event values (2, '2000-01-04', 2) 
insert into #event values (2, '2000-01-05', 1) 
insert into #event values (3, '2000-01-01', 1) 
insert into #event values (3, '2000-01-02', 2) 
insert into #event values (3, '2000-01-03', 1) 
insert into #event values (3, '2000-01-04', 2) 
insert into #event values (3, '2000-01-05', 2) 
insert into #event values (4, '2000-01-01', 2) 
insert into #event values (4, '2000-01-02', 2) 
insert into #event values (4, '2000-01-03', 3) 
insert into #event values (4, '2000-01-04', 1) 
insert into #event values (4, '2000-01-05', 1) 

go 

select e1.* from #event e1 
where (
    not exists (
        select top 1 1 
        from #event e2 
        where e1.group_id = e2.group_id 
        and e2.date < e1.date 
        and e2.type <> 1
    ) or not exists (
        select top 1 1 
        from #event e2 
        where e1.group_id = e2.group_id 
        and e2.date > e1.date 
        and e2.type <> 1
    ) 
)
and e1.type = 1

Expected results:

1   2000-01-01 00:00:00.000 1
1   2000-01-02 00:00:00.000 1
1   2000-01-05 00:00:00.000 1
2   2000-01-05 00:00:00.000 1
3   2000-01-01 00:00:00.000 1
4   2000-01-04 00:00:00.000 1
4   2000-01-05 00:00:00.000 1

This all works just fine and returns my expected results, but it scans through the table 3 times. Is there any way to make this perform faster and reduce the number of table scans?

A: 

From what I understand you are after is the head and tail, ordered by day**, for each ID**. The head and tail being all the records until record where the type is not one is encountered.

This is a different way of doing it, it may be faster

;WITH Ranked AS (
    SELECT 
     *,
     Row_Number() OVER (ORDER BY date) as 'rnk'
    FROM #event
)


SELECT * 
FROM Ranked
WHERE rnk not between 
     (SELECT Min(rnk) FROM Ranked r WHERE r.type <> 1 AND ranked.id = r.id)
     AND (SELECT Max(rnk) FROM Ranked r WHERE r.type <> 1 AND ranked.id = r.id)
order by id

You do not need to use TOP with the exists statement although it doesn't hurt.

Robert Wagner
This is taking twice the number of reads an 20% longer according to an sql trace, according to query execution plan its 30 times slower (but the number in the plan are often way too high)
Sam Saffron
A: 

Basically looks like you are looking to get event of type 1 that are lesser than in timestamp than other events and greater than timestamp event of other date Try this, its written in Oracle Syntax, not sure about MSSQL.

select e1.* from e1 where e1.id = 1 and (e1.date <= ( select min(e2.date) from e2 where e2.id <> 1 group by e2.date ) or (e1.date >= select max(e3.date) from e3 where e3.id <> 1 group by e3.date ) )

Dheer
I can not get this SQL to work in SQL Server ... even once i alias everything properly the sub-queries produce an error
Sam Saffron
A: 

Consider

create index idx2 on #event (type)

I don't have SQL Server to check but in Oracle it will eliminate the top-level scan (for 'type=1' condition).

Regarding the query itself - in MS SQL 2000 '[not] exists' and '[not] in' predicates almost always did full scan - we were replacing them with appropriate JOIN's.

Dmitry Khalatov
another index does not help out here, same performance, 3 table scans. I guess part of the question is "is there any way to replace this with a join"
Sam Saffron
A: 

With only 20 records the query optimizer may easily be concluding that 3 table scans is the quickest way to do it no matter how you express the query or what indexes you create. With small data records the entire table would probably load with one or a handful of disk reads; and within a read block there's not much optimization involved; the optimizer primarily is interested in minimizing disk activity, and all else is comparatively inconsequential. What the optimizer does with so few records is not a good indication of how it will handle larger volumes.

Do you have a real table with more data in it? I can't imagine you can perceive any exectuion time at all.

le dorfier
Fair point will amend the question to include a massive data set
Sam Saffron
Ok still getting exactly 3 table scans with Roberts Query it took too long to return a result
Sam Saffron
btw, I use both SQL server profiler and I look at the execution plan to determine performance
Sam Saffron
+1  A: 

To generate a large subset of data you can use this:

declare @i int 
set @i = 10000
while @i > 5 
begin
    insert into #event values (@i, '2000-01-01', 1) 
    insert into #event values (@i, '2000-01-02', 1) 
    insert into #event values (@i, '2000-01-03', 3) 
    insert into #event values (@i, '2000-01-04', 2) 
    insert into #event values (@i, '2000-01-05', 1)  
    set @i = @i -1 
end

Also, to include lots of events per group try this:

declare @j int 
set @j = 0 
while @j < 10
begin 
    set nocount on 
    declare @i int 
    set @i = 0
    while @i < 10000 
    begin
     insert into #event values (@j, DateAdd(d, @i, '2000-01-01'), rand(10) * 10) 

     set @i = @i  +1 
    end
    set @j = @j + 1 
end
set nocount off

In all my testing it seems my original query only produces 3 table scans and I am not really sure if performance can be improved here.

Sam Saffron
A: 

Here is a version with joins:

select distinct e1.* from #event e1 
left outer join #event e2 ON 
     e1.id = e2.id 
        and e2.date < e1.date 
        and e2.type <> 1
left outer join #event e3 ON
     e1.id = e3.id 
        and e3.date > e1.date 
        and e3.type <> 1
where e1.type = 1 AND (e2.id is null or e3.id is null)

This still has three table scans plus a distinct clause, but it still seems to be faster than the original query.

csgero
A: 

Here's a query whose showplan at least doesn't say "table scan" in it, and gives the same answer.

I don't understand the query in logical terms yet though.

SELECT DISTINCT e1.* FROM #event e1
WHERE e1.type = 1
AND
(
NOT EXISTS (
SELECT 1 FROM #event
WHERE type != 1
AND id = e1.id
AND date < e1.date
)
OR NOT EXISTS (
SELECT 1 FROM #event
WHERE type != 1
AND id = e1.id
AND date > e1.date
)
)

le dorfier
Nope, hold that - I tested after applying your data generator. But don't concede yet.
le dorfier
A: 

This is better than the last one:

SELECT e1.* FROM event e1
WHERE e1.type = 1
AND NOT EXISTS (
SELECT 1 FROM event
WHERE type != 1
AND id = e1.id
AND date < e1.date
)
UNION ALL SELECT e1.* FROM event e1
WHERE e1.type = 1
AND NOT EXISTS (
SELECT 1 FROM event
WHERE type != 1
AND id = e1.id
AND date > e1.date
)

It gets the same result. But your query, and my previous one, are awful. Here's the IO statistics:

(29985 row(s) affected)

Table 'event'. Scan count 9997, logical reads 20477, physical reads 0, read-ahead reads 0.
Table 'Worktable'. Scan count 39979, logical reads 99950, physical reads 0, read-ahead reads 0.

Here's the same for the most recent query:

(29985 row(s) affected)

Table 'event'. Scan count 4, logical reads 652, physical reads 0, read-ahead reads 0.

Two things to notice --
1. Even worst case, the whole table gets loaded and there are no physical reads.
2. The bad ones have 9997 + 39979 table scans. Not just 4.

Please describe the intent of the query.

le dorfier
+1  A: 

I think this is better:

select e1.group_id, e1.date, e1.type
from #event e1, #event e2
where e1.type = 1
and e2.type <> 1
and e1.group_id= e2.group_id
group by e1.group_id, e1.date, e1.type, e2.group_id
having e1.date < min(e2.date) or e1.date > max(e2.date)

adamant7
Though this does not achieve the ideal one table scan, it does reduce the table scan to 2 and heavily simplify the execution plan. +1
Sam Saffron
Originally I thought this is faster but it collapses if you have lots of events for the same group
Sam Saffron