views:

61

answers:

3

I've got a large query where a simple subquery optimization dropped it from 8 minutes down to 20 seconds. I'm not sure I understand why the optimization had such a drastic effect.

In essence, here's the problem part:

SELECT  (bunch of stuff)
FROM
  a LEFT OUTER JOIN b ON a.ID = b.a
  LEFT OUTER JOIN c ON b.ID = c.b
  ...
  ...
      INNER JOIN veryLargeTable 
      ON a.ID = veryLargeTable.a 
         AND veryLargeTable.PetID = 
             (SELECT id from Pets WHERE Pets.Name = 'Something')    /* BAD! */
  ...
  ...

In all, there are 16 joined tables. If I replace the second predicate of the veryLargeTable join with a pre-populated variable containing the petID (instead of using the subquery) the entire query speeds up dramatically:

AND veryLargeTable.PetID = @petID   /* Awesome! */


Obviously, (SELECT id from Pets WHERE Name = 'Something') is being executed for every row. There are two things I don't fully understand:

  1. As far as I can tell, this is a non-correlated subquery. The Pets table is not part of the outer query at all. Aren't non-correlated subqueries independently evaluated (and hence optimized)? Why isn't this the case here?

  2. The execution plans are dramatically different. In the failure case (above), the entire subtree deals with an estimated 950k rows. In the win case (using a variable instead of a subquery), there's only about 125k estimated rows. What's going on? Why are so many more rows involved if that subquery is there? The Pets.Name column definitely has unique data (but no unique constraint as far as I can tell).

Note that moving the predicate to the WHERE clause doesn't affect the query in either case, as I would expect, since it's an INNER JOIN.

Insights appreciated!

+4  A: 

As an alternative, I think you could eliminate the sub-query with:

...
INNER JOIN veryLargeTable vLT
    ON a.ID = vLT.a 
INNER JOIN Pets p
    ON vLT.PetID = p.id
        and p.Name = 'Something'
...
Joe Stefanelli
+3  A: 

It has been my experience that, the more complex your queries get, the less able the SQL optimizer is to create deft plans. Here you've got 16 joins, some or most are outer joins, you've got at least one subquery... toss in enough indexes, cardinalities, views, outer applies, and who knows what else and no one, not even Microsoft engineers*, can figure out routines that will uniformly and regularly generate The most optimal plans.

What you've described, I've experienced numerous times -- change one simple thing in a messy query and everything's an order of magnitude faster (or, gnashes teeth, slower). I have no method for determining when complex is too complex, it's more a feeling than anything else. My general rule of thumb is, if it looks too long or too complex, simplify where you can--such as your pre-selected single nested value, or breaking out part of the query than will always run fast with a small result set, and running it first and storing the results in a temp table.

( * Please note that this is mild sarcsam)

Philip Kelley
A: 

I personally think the outcome is not surprising if there's no index on Pets.Name. If you create a unique index on Pets.Name you are likely to see better results. Without index from the server's point of view the subquery may return multiple rows or NULL. Perhaps optimiser could do better; it often needs help.

vaso
The thought had crossed my mind, but the query is non-correlated, so I had thought that it would be evaluated independently. I'll try creating the constraint and see what happens.
womp