views:

96

answers:

2

Why does the comparison of value to null return false, except when using a NOT IN, where it returns true?


Given a query to find all stackoverflow users who have a post:

SELECT * FROM Users
WHERE UserID IN (SELECT UserID FROM Posts)

This works as expected; i get a list of all users who have a post.

Now query for the inverse; find all stackoverflow users who don't have a post:

SELECT * FROM Users
WHERE UserID NOT IN (SELECT UserID FROM Posts)

This returns no records, which is incorrect.

Given hypothetical data1

Users              Posts
================   ===============================
UserID  Username   PostID   UserID  Subject
------  --------   -------  ------  ----------------
1       atkins     1        1       Welcome to stack ov...
2       joels      2        2       Welcome all!
...     ...        ...      ...
399573  gt6989b    ...      ...
...     ...        ...      ...
                   10592    null    (deleted by nsl&fbi...
                   ...      ... 

And assume the rules of NULLs:

  • NULL = NULL evaluates to unknown
  • NULL <> NULL evaluates to unknown
  • value = NULL evaluates unknown

If we look at the 2nd query, we're interested in finding all rows where the Users.UserID is not found in the Posts.UserID column. i would proceed logically as follows:

Check UserID 1

  • 1 = 1 returns true. So we conclude that this user has some posts, and do not include them in the output list

Now check UserID 2:

  • 2 = 1 returns false, so we keep looking
  • 2 = 2 returns true, so we conclude that this user has some posts, and do not include them in the output list

Now check UserID 399573

  • 399573 = 1 returns false, so we keep looking
  • 399573 = 2 returns false, so we keep looking
  • ...
  • 399573 = null returns unknown, so we keep looking
  • ...

We found no posts by UserID 399573, so we would include him in the output list.

Except SQL Server doesn't do this. If you have a NULL in your in list, then suddenly it finds a match. It suddenly finds a match. Suddenly 399573 = null evaluates to true.

Why does the comparison of value to null return unknown, except when it returns true?

Edit: i know that i can workaround this nonsensical behavior by specifically excluding the nulls:

SELECT * FROM Users
WHERE UserID NOT IN (
   SELECT UserID FROM Posts
   WHERE UserID IS NOT NULL)

But i shouldn't have to, as far as i can tell the boolean logic should be fine without it - hence my question.

Footnotes

  • 1 hypothetical data; if you don't like it: make up your down.
  • celko now has his own tag
+1  A: 

Common problem, canned answer:

The behavior of NOT IN clause may be confusing and as such it needs some explanations. Consider the following query:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL)

Although there are more than a thousand distinct last names in AdventureWorks.Person.Contact, the query returns nothing. This may look counterintuitive to a beginner database programmer, but it actually makes perfect sense. The explanation consist of several simple steps. First of all, consider the following two queries, which are clearly equivalent:

SELECT LastName, FirstName FROM Person.Contact

WHERE LastName IN('Hedlund', 'Holloway', NULL)



SELECT LastName, FirstName FROM Person.Contact

WHERE LastName='Hedlund' OR LastName='Holloway' OR LastName=NULL

Note that both queries return expected results. Now, let us recall DeMorgan's theorem, which states that:

not (P and Q) = (not P) or (not Q)

not (P or Q) = (not P) and (not Q)

I am cutting and pasting from Wikipedia (http://en.wikipedia.org/wiki/De_Morgan_duality). Applying DeMorgan's theorem to this queries, it follows that these two queries are also equivalent:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL)



SELECT LastName, FirstName FROM Person.Contact

WHERE LastName<>'Hedlund' AND LastName<>'Holloway' AND LastName<>NULL

This last LastName<>NULL can never be true

AlexKuznetsov
DeMorgan's theorem applies to boolean logic, not to 3-valued logic. For example: `not (unknown or true) = false`, but `(not unknown) and (not true) = unknown`, violating the theorem
Andomar
My confusion comes from the second "equivalent" set of queries. i would argue that they're not equivalent - since they return different answers. But my argument is irrelevant, since the ANSI guys *defined* the first is an alias of the 2nd.
Ian Boyd
Accepted this answer, as it points out `NOT IN` is defined to be an alias for `(x<>a) AND (x<>b) AND (x<>c) ...`.
Ian Boyd
+2  A: 

The assumption in your first sentence isn't right:

Why does the comparison of value to null return false, except when using a NOT IN, where it returns true?

But comparison of a value to null does not return false; it returns unknown. And unknown has its own logic:

unknown  AND  true   = unknown
unknown  OR   true   = true
unknown  OR   false  = unknown

One example of how this works out:

where 1 not in (2, null)
--> where 1 <> 2 and 1 <> null
--> where true and unknown
--> where unknown

The where clause only matches on true, so this filters out any row.

You can find the full glory of 3 value logic at Wikipedia.

Andomar
i changed all references involving `AND null` to evaluate to `unknown`. Question still stands.
Ian Boyd