views:

158

answers:

2

I'm noticing something a bit unexpected with how SQL Server (SQL Server 2008 in this case) treats correlated subqueries within a select statement. My assumption was that a query plan should not be affected by the mere order in which subqueries (or columns, for that matter) are written within the projection clause of the select statement. However, this does not appear to be the case.

Consider the following two queries, which are identical except for the ordering of the subqueries within the CTE:

--query 1: subquery for Color is second
WITH vw AS
(
 SELECT p.[ID],
  (SELECT TOP(1) [FirstName] FROM [Preference] WHERE p.ID = ID AND [FirstName] IS NOT NULL ORDER BY [LastModified] DESC) [FirstName],
  (SELECT TOP(1) [Color] FROM [Preference] WHERE p.ID = ID AND [Color] IS NOT NULL ORDER BY [LastModified] DESC) [Color]
 FROM Person p
)
SELECT ID, Color, FirstName
FROM vw
WHERE Color = 'Gray';


--query 2: subquery for Color is first
WITH vw AS
(
 SELECT p.[ID],
  (SELECT TOP(1) [Color] FROM [Preference] WHERE p.ID = ID AND [Color] IS NOT NULL ORDER BY [LastModified] DESC) [Color],
  (SELECT TOP(1) [FirstName] FROM [Preference] WHERE p.ID = ID AND [FirstName] IS NOT NULL ORDER BY [LastModified] DESC) [FirstName]
 FROM Person p
)
SELECT ID, Color, FirstName
FROM vw
WHERE Color = 'Gray';

If you look at the two query plans, you'll see that an outer join is used for each subquery and that the order of the joins is the same as the order the subqueries are written. There is a filter applied to the result of the outer join for color, to filter out rows where the color is not 'Gray'. (It's odd to me that SQL would use an outer join for the color subquery since I have a non-null constraint on the result of the color subquery, but OK.)

Most of the rows are removed by the color filter. The result is that query 2 is significantly cheaper than query 1 because fewer rows are involved with the second join. All reasons for constructing such a statement aside, is this an expected behavior? Shouldn't SQL server opt to move the filter as early as possible in the query plan, regardless of the order the subqueries are written?

Edit: Just to clarify, there is a valid reason I'm exploring this scenario. I may need to create a view that involves similarly constructed subqueries, and it is now apparent that any filtering based on these columns projected from the view will vary in performance just because of the ordering of the columns!

+2  A: 

With the TOP operator coming into play here, the Query Optimizer is remarkably blind about the statistics, so it will look for other clues about how best to work it, such as instantiating relevant parts of the CTE first.

And it's an outer join because the subquery will be used as NULL if nothing is returned, and the system is instantiating it first. If you were using an aggregate instead of TOP, you'd probably get a slightly different but more consistent plan.

Rob Farley
You may prefer to use row_number to get your result, as it can be an effective replacement for TOP.
Rob Farley
Yes, I understand your point on statistics, but shouldn't the optimizer make the assumption that any filtering at worst leaves all rows, but at best removes many of the rows? Also, if you construct a more typical query using an outer join and then put a non-null constraint on a joined column, the optimizer is smart enough to use an inner join. Why not in this case?
Michael Petito
Also I can't easily use aggregation in this case because there is no null coalescing aggregate in SQL.
Michael Petito
A scalar subquery in the select clause can never filter the set of data, so therefore it's implemented as an outer join. The fact that you're filtering it elsewhere means that it could potentially filter out NULLs first by doing an inner join, but there's no advantage to doing this, and the QO chooses to use an outer join (which is generally quicker, as the system doesn't have to delete rows from its working set).
Rob Farley
I know you can't use aggregates in this particular case. It would be nice if T-SQL had implemented a FIRST(Field) OVER (ORDER BY ...) function, but it doesn't have that yet.
Rob Farley
Yes, a FIRST windowed function would obviate the need for subqueries or joins entirely and probably perform best. I wonder if that will ever become part of T-SQL?
Michael Petito
+1  A: 

Here is an alternate version that might perform better:

With Colors As
    (
    Select Id, [Color]
        , ROW_NUMBER() OVER ( PARTITION BY ID ORDER BY [LastModified] DESC ) As Num
    From Preference
    Where [Color] Is Not Null
    )
    , Names As
    (
    Select Id, [FirstName]
        , ROW_NUMBER() OVER ( PARTITION BY ID ORDER BY [LastModified] DESC ) As Num
    From Preference
    Where [FirstName] Is Not Null
    )
Select
From Person As P
    Join Colors As C
        On C.Id = P.Id
            And C.Num = 1
    Left Join Names As N
        On N.Id = P.Id
            And N.Num = 1
Where C.[Color]= 'Grey'

Another solution which is more concise but may or may not perform as well:

With RankedItems
    (
    Select Id, [Color], [FirstName]
        , ROW_NUMBER() OVER ( PARTITION BY ID ORDER BY Case When [Color] Is Not Null 1 Else 0 End DESC, [LastModified] DESC ) As ColorRank
        , ROW_NUMBER() OVER ( PARTITION BY ID ORDER BY Case When [FirstName] Is Not Null 1 Else 0 End DESC, [LastModified] DESC ) As NameRank
    From Preference
    )
Select
From Person As P
    Join RankedItems As RI
        On RI.Id = P.Id
            And RI.ColorRank = 1
    Left Join RankedItems As RI2
        On RI2.Id = P.Id
            And RI2.NameRank = 1
Where RI.[Color]= 'Grey'
Thomas
Thanks Thomas, I'm examining your first query now and it looks pretty good. So far it seems the generated plan is both independent of order and the joins that are filtered on are always performed first and as inner joins.
Michael Petito
Your first query also seems to scale better than my original query in terms of IO performance since SQL scans for subsequent joins instead of using repeated index seeks. The second query is more concise (though not as clear), but unfortunately it requires expensive sorts on the result of the case statements. I can easily avoid the sorting required in the first query by clustering on ID, LastModified DESC.
Michael Petito