views:

85

answers:

4

I am trying to decide on which approach to take in a database I am designing. I will be adding a ProcessedDate datetime null column in a table. It will be nullable for when the record has not been processed. So, is it worth having a Processed bit not null default 0 column as well?

With the following queries:

select * from tablename where ProcessedDate is null

and

select * from tablename where Processed = 0

All things being equal*, what is the performance difference between the two versions?

*: Appropriate indices are applied to the table in each version. I am not looking for advice on what indices to create. I only want information about the performance of the filter as applied to a single row. If all rows in the table need to be scanned, or a seek is done is irrelevant to the question at hand.

I know that an argument could be made that having the Processed flag is more explicit and therefore more readable, however it also raises issues around keeping the columns in sync (which could be handled by using a computed column). Anyway, I want to limit the scope of this question to the performance side of things.

+2  A: 

You should not have both.

Pick the one that matches your domain model best.

With the right index, any performance difference will negligible (assuming that the relative distribution of 'processed' to 'not processed' means the optimizer will pick an index in the first place).

Mitch Wheat
+2  A: 

As with most SQL performance questions, the answer will be driven by your table schema, not by your SQL text. As it is right now, both version require a full table scan, so they are equaly bad.

A non clustered index on ProcessedDate may help the first version, but since you ask for * in the projection, the index tipping point may kick in and still hit a full scan.

On the other hand a clustered index with the leftmost key Processed will always work, and such a clustered index structure is common in queue processing (and having a 'ProcessedDate' column is a clear give away that your table is used as a queue).

But then a leftmost column on ProcessedDate for the clustered index would also work just like the Processed bit.

The lesson to take home: is not the query, is the schema. Always.

Remus Rusanu
@Remus Rusanu: +1. very nice answer. Was just checking your web site. Some very useful info...
Mitch Wheat
I'd say it is a good answer, expect it is a good answer to a different question. I've asked a specific question about the performance of two clauses, and received an entry level overview of indexing. I asked the specific question because I haven't seen anything about it elsewhere, and next time someone else is wondering about it, all they'll see is this answer, which doesn't help.How about a followup? Do either of you know the answer to the question I asked?
David
+3  A: 

They are identical as far as I can tell with equal sets of data. You can run my example below and look at the execution plans to check. You'll see that the filter element where it tests the predicate takes up the same proportion of time in each. Even if indexes are added it wouldn't make one more efficient than the other as they are basically doing the same thing.


WITH Test (MyInt, MyNull)
AS
(
SELECT 1 AS MyInt, Null AS MyNull

UNION ALL

SELECT 
    MyInt + 1,
    CASE
        WHEN MyInt % 2 = 0 THEN Null
        ELSE MyInt
    END
FROM Test
WHERE MyInt < 10000
)

SELECT * FROM Test
WHERE MyNull IS NULL
OPTION (MAXRECURSION 32767);

WITH Test (MyInt, MyBit)
AS
(
SELECT 1 AS MyInt, 0 AS MyBit

UNION ALL

SELECT 
    MyInt + 1,
    CASE
        WHEN MyInt % 2 = 0 THEN 0
        ELSE 1
    END
FROM Test
WHERE MyInt < 10000
)

SELECT * FROM Test
WHERE MyBit = 0
OPTION (MAXRECURSION 32767);

I know my example is contrived but it uses equal numbers of nulls and 0's. Even if you re-write the CTE for each example into table variables you should see the same performance.

Joe Swan
+1  A: 

One advantage of the ProcessedDate is null predicate is that you don't risk getting bitten by (explicit or implicit) parametrization.

If you use where Processed = 0, SQL server might internally parametrize this to Processed = @p, and the generate the query plan based on the assumption that @p = 0. However, if you somewhere do the same query but where Processed = 1, you might get a plan for @p = 1, and since the data is likely to be skewed, it's likely that the plans won't be the same.

erikkallen
So does this mean that there is potentially extra overhead in the second execution plan generation for the bit version? Does the null-checking version generate one execution plan that is larger? I'm not sure I see what would happen if the query were run with ProcessedDate is not null.
David
IS NULL / IS NOT NULL cannot be parameterized. But don't care whether the plan is "larger" (whatever that means) or whether it needs to be regenerated. Of all things that take time in a database call, query plan generation is not one of them.
erikkallen