views:

71

answers:

3

I have two SQL queries producing different results when I would expect them to produce the same result. I am trying to find the number of events that do not have a corresponding location. All locations have an event but events can also link to non-location records.

The following query produces a count of 16244, the correct value.

SELECT COUNT(DISTINCT e.event_id)   
FROM   events AS e   
WHERE  NOT EXISTS   
  (SELECT * FROM locations AS l WHERE l.event_id = e.event_id)    

The following query produces a count of 0.

SELECT COUNT(DISTINCT e.event_id) 
FROM   events AS e
WHERE  e.event_id NOT IN (SELECT  l.event_id FROM locations AS l)

The following SQL does some summaries of the data set

SELECT  'Event Count', 
        COUNT(DISTINCT event_id) 
        FROM events

UNION ALL

SELECT  'Locations Count', 
        COUNT(DISTINCT event_id) 
        FROM locations

UNION ALL

SELECT  'Event+Location Count', 
        COUNT(DISTINCT l.event_id) 
        FROM locations AS l  JOIN events AS e ON l.event_Id = e.event_id

And returns the following results

Event Count         139599
Locations Count         123355
Event+Location Count    123355

Can anyone shed any light on why the 2 initial queries do not produce the same figure.

+4  A: 

The NOT IN form works differently for NULLs. The presence of a single NULL will cause the entire statement to fail, thus returning no results.

So you have at least one event_id in locations that is NULL.

Also, your query might be better written as a join:

SELECT 
    COUNT(DISTINCT e.event_id)    
FROM
    events AS e  
    LEFT JOIN locations AS l ON e.event_id = l.event_id
WHERE
    l.event_id IS NULL

[UPDATE: apparently, the NOT EXISTS version is faster.]

Mitch Wheat
+1 Thanks. Out of interest why would it be better written as a join? The Joined version is marginally slower for my dataset but both the exists and the join are treated in a very similar fashion according to the execution plan but the JOIN does a Merge Join (Left Outer) and the EXISTS does a Merge Join (Left Anti Semi)
Steve Weet
@Steve It wouldn't. See http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/
Martin Smith
Thanks Martin. That was an excellent article.
Steve Weet
Wow!, Thanks, Martin. I'll have to revise what I know, and do a spot of testing in SQL Server 2008 R2.
Mitch Wheat
Is WHERE NOT EXISTS (SELECT 1 FROM locations ...) better than using '*' , or the same plan?
Mitch Wheat
I'm not so sure about joining on the char(4) columns. What happens when using a 'usual' join on int columns?
Mitch Wheat
@Mitch - RE: `SELECT 1` That's something I'm not sure about. I believe this used to make a difference but now they will use exactly the same plan but I have noticed Phil Factor still uses it (e.g. http://www.simple-talk.com/community/blogs/philfactor/archive/2009/08/03/74227.aspx). I'm not sure if this is out of habit or for some other reason.
Martin Smith
+3  A: 

You have a NULL in the subquery SELECT l.event_id FROM locations AS l so NOT IN will always evaluate to unknown and return 0 results

SELECT COUNT(DISTINCT e.event_id) 
FROM   events AS e
WHERE  e.event_id NOT IN (SELECT  l.event_id FROM locations AS l)

The reason for this behaviour can be seen from the below example.

'x' NOT IN (NULL,'a','b')

≡ 'x' <> NULL and 'x' <> 'a' and 'x' <> 'b'

≡ Unknown and True and True

≡ Unknown

Martin Smith
A: 

In and Exists are processed very very differently.

Select * from T1 where x in ( select y from T2 )

is typically processed as:

select * 
  from t1, ( select distinct y from t2 ) t2
 where t1.x = t2.y;

The subquery is evaluated, distinct'ed, indexed (or hashed or sorted) and then joined to the original table -- typically.

As opposed to

select * from t1 where exists ( select null from t2 where y = x )

That is processed more like:

for x in ( select * from t1 )
   loop
      if ( exists ( select null from t2 where y = x.x )
      then 
         OUTPUT THE RECORD
      end if
   end loop

It always results in a full scan of T1 whereas the first query can make use of an index on T1(x).

Bharat
In my particular case the query optimizer uses exactly the same query plan for both queries.
Steve Weet