views:

112

answers:

4

Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?

SELECT DISTINCT S# 
  FROM shipments 
 WHERE P# IN (SELECT P# 
                FROM parts 
               WHERE color = ‘Red’)

vs.

SELECT DISTINCT S# 
  FROM shipments, parts 
 WHERE shipments.P# = parts.P# 
   AND parts.color = ‘Red’
+1  A: 
SELECT DISTINCT S# 
FROM shipments,parts 
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;

Using IN forces sql server to not use indexing on that column, and sub quieres are usually slower

george9170
Does using "IN" really force SQL Server to not use indexing? Can you point me to some more reading?
Jeff Meatball Yang
@Jeff I am sorry, It was NOT IN that forces the sql server to not use column indexing on the columns specified in the querythe over all google query i used was sql where sargable sargable - The term "sargable" (which is in effect a made-up word) comes from the pseudo-acronym "SARG", which stands for "S earch ARG ument,"http://www.sql-server-pro.com/sql-where-clause-optimization.htmlhttp://groups.google.com/group/huyuhui/web/a-term-sargable
george9170
George - this approach to the problem is (like the normal inner join) more sensitive to the index on the parts.color field. However, in terms of time complexity, I'm finding no difference between them despite some pretty significant differences in the execution plan. I may just need a bigger dataset but I'm not finding any advantage so far...
Mark Brittingham
Correct (about how it isnt faster), it is just the old school syntax of sql. I would write OMGponies query instead of the on I am using. I cut and pasted what the OP wrote done, as i thought s/he would be more familiar with it.
george9170
+2  A: 

Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:

SELECT DISTINCT S#
  FROM SHIPMENTS s
  JOIN PARTS p ON p.p# = s.p#
              AND p.color = 'Red'

Check your explain plans to see which is better, because it depends on data and table setup.

OMG Ponies
+2  A: 

If you aren't selecting anything from the table I would use an EXISTS clause.

SELECT DISTINCT S# 
FROM shipments a
WHERE EXISTS (SELECT 1
              FROM parts b
              WHERE b.color = ‘Red’
                AND a.P# = b.P#)

This will optimize out to be the same as the second one you posted.

HaxElit
Haxelit - you've created a monster now (;-). When I create the execution plan for this query it is identical to that of the *SUBQUERY* and quite different from that of the inner join. It is no faster, either.
Mark Brittingham
BTW - as you'll see above, I made significant changes based on your feedback. Thank you for your comment!
Mark Brittingham
If you wanted to see this thing fly add the following indexesto shipments (P#, S#), and to parts (color, P#). Make sure they are in that order.That would most likely cause a merge join that should fly even on a million record table.
HaxElit
+2  A: 

The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.

Starting here, this answer has been updated

The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:

Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)

yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.

A query of the form:

Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID) 
  Where A.Gender='M'

yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.

These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.

Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:

Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B 
  on (A.UID=B.UID) 
  Where A.Gender='M'

The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.

When you add a predicate to the search on table 2 using a subquery:

Select F1, F2, F3 From Table1 Where F4='X' And UID in 
    (Select UID From Table2 Where F1='Y')

then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.

When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.

One last thing: someone suggested (and your question now includes) a query of the form:

Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'

The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.

With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.

So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.

In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.

So, the takeaway is simply that the best solution is the one that makes your development the most efficient.

Mark Brittingham
What is the Execution Plan? Is that in addition to the query writer, where you type out your SQL? I'm a little new Management Studio, so I'm not sure where everything is...
Nate Bross
Nate, create a query in a new query window in SSMS (management studio). Pull down the "Query" menu and select "Display Estimated Execution Plan" (you'll see it is Ctrl-L). That's all there is to it. One note though: I am using SQL Server 2008, Developer's Edition. I remember a similar command in SQL Server 2005 but it has been awhile so I'm not sure if the menus have changed. If you want more information about what you see there, there are quite a few books and web sites covering it. I have a specific book to recommend but it is at home. I'll try to get back later to give you the name.
Mark Brittingham
Slightly misleading. EXISTS/NOT EXISTS is *consistently* most efficient. SQL Server will probably generate the same plan for IN/EXISTS. NOT IN/NOT EXISTS are different. (OUTER) JOIN often. requires a distinct. Ergo , always use EXISTS/NOT EXISTS
gbn
"Subqueries are generally considered to be slower as the results must be collected before the comparison." The results must not be collected before the comparison. The optimizer is smart enough to know that it doesn't have to select the entire data set inside the sub-select before it can join onto it. Sub-selects are just a nice way to logically group operations, they don't always materialize all their data. So for one being faster then the other, it all depends on how it's optimized.
HaxElit
haxelit - when I first answered this question I had gone into the SQL Server and run a query using both type of constructs and a table scan was prominant among the operations needed when using a subquery. So when I saw your comment I thought you were A) wrong and B) splitting hairs. Before jumping to conclusions, though, I went back and did more testing and have to admit that you make a good point. I am changing my answer and alerting Nate in case he wants to select a different answer as "the answer". Thanks for the feedback.
Mark Brittingham
@Mark: Thank you for taking the time to explore this. Having grown up with INNER JOINs I just assumed (pun intended) that they were faster than sub queries. I'm usually the guy who says "test it!" in order to determine the truth of the matter so this helped bring out one of my own blind spots.
Chris Lively
Thank you for such a detailed response! I'm def going to checkout the tools in the query builder, thanks!
Nate Bross
@Chris - actually thanks should go to haxelit. I had made some assumptions as well based on an answer that I saw elsewhere on SO and just passed them on. It was haxelit that lit the fire under my butt to really check the assumptions in detail.
Mark Brittingham