ansaurus

Question

Answer 1

+1 A:

I would use a join, betting that it'll be a heck of a lot faster than IN. This presumes that there are primary keys defined, of course, thus letting indexing speed things up tremendously.

Alan 2009-06-16 13:41:00

Answer 2

+3 A:

The IN is evaluated (and the select from b re-run) for each row in a, whereas the JOIN is optimized to use indices and other neat paging tricks...

In most cases, though, the optimizer would likely be able to construct a JOIN out of a correlated subquery and end up with the same execution plan anyway.

Edit: Kindly read the comments below for further... discussion about the validity of this answer, and the actual answer to the OP's question. =)

J. Steen 2009-06-16 13:41:29

Does EXISTS behave in the same manner as IN?

Rich.Carpenter 2009-06-16 13:43:21

@JSTeen: No. The IN is a sequential "OR". @Rich: No ,EXISTS is a form of JOIN but gives more correct results

gbn 2009-06-16 13:46:32

This answer is a big over-simplification. In a large number of cases, the optimizer will rephrase the IN as a JOIN, and it'll be just as fast as if you'd written the join yourself.

skaffman 2009-06-16 13:47:55

@skaffman is right, I have never seen a bogus plan like this for IN. It is always a join in my experience.

2009-06-16 13:49:14

Trusting the optimizer is a fine thing, but you generally need to know a lot more about how and what the optimizer accomplishes with your query. Which is why I advocate using the method that's safer, in more cases. Yes? =)

J. Steen 2009-06-16 13:50:38

@J. Steen: This is almost always wrong.

Quassnoi 2009-06-16 13:51:07

However, I won't refuse to admit I'm wrong, if someone could point to some sort of article that explains the pitfalls? =)

J. Steen 2009-06-16 13:54:11

SQL Server BOL? http://msdn.microsoft.com/en-us/library/ms188336.aspxIn this case, EXISTS deals with NULL: Give me 5 to find more...

gbn 2009-06-16 13:56:52

... and http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

gbn 2009-06-16 13:57:26

@J. Steen: http://explainextended.com/2009/04/20/counting-missing-rows-sql-server/ for SQL Server, and follow the links for other systems.

Quassnoi 2009-06-16 13:57:33

My MVP collegues has reminded me that IN can match a JOIN or EXISTS if the columns are *not* indexed. EXISTS and JOINS like indexes, usually, but IN is still lots of OR statements

gbn 2009-06-16 14:01:08

@gbn, Right, EXISTS is superior to IN because of how it handles expected and returned results, fine. Does it outperform JOINs, though? I'm curious - if it does, I'll have to reevaluate. =)

J. Steen 2009-06-16 14:03:44

@gbn: if there are not indexes, HASH JOIN / HASH SEMI JOIN will be chosen for JOIN / IN, appropriately. If the values of b.d are unique, both these methods will return exactly same results in exactly same time. If they're not, the JOIN and the IN queries will return different resultsets. IN resultset will of course be shorter and will be returned FASTER than the JOIN one.

Quassnoi 2009-06-16 14:05:33

@J. Steen: EXISTS will perform EXACTLY as fast as an IN query if there is an index, and MUCH slower if there is no index. It's EXISTS query that needs to be reevaluated for each for. For an IN query, a hash table will be built.

Quassnoi 2009-06-16 14:07:23

@Quassnoi, Oho! And that hashtable will then be used in subsequent hits of the subquery, I see!

J. Steen 2009-06-16 14:08:36

@J. Steen: all methods that involve an IN clause are called SEMI JOIN's. There are NESTED LOOP SEMI JOIN, MERGE SEMI JOIN and HASH SEMI JOIN, neither of them reevaluates the whole subquery for each row. It's in fact query optimization 101 :)

Quassnoi 2009-06-16 14:11:12

@Quassnoi, remind me to spank my former DBA colleague, then. He quite obviously knew nothing of this and lured me into corruption!

J. Steen 2009-06-16 14:13:24

@J. Steen: what you said initially is true for a non-indexed table in MySQL, since it's not capable of doing HASH JOIN's, but it's certainly not true for SQL Server. SQL Server could make HASH JOIN's since version 6.5 (it's when I started to use it), and probably even earlier.

Quassnoi 2009-06-16 14:20:52

@Quassnoi, well, that makes sense. The DBA started his career with MySQL to later convert to MSSQL.

J. Steen 2009-06-16 14:22:22

@Quassnoi, re: IN vs EXISTS - iirc this is the opposite situation to SQL2000 (which I learned SQL on), can you point me at a source for this?

annakata 2009-06-16 14:37:40

@annakata: EXISTS is more flexible than IN: any IN can be converted to an EXISTS, but not vice versa. For simple cases, when EXISTS can be converted into IN, in will be converted into some kind of a SEMI JOIN, and will perform as fast as an IN. For complex non-sargable and non-hashable cases, like EXISTS (SELECT 1 FROM subtable WHERE FUNCTION(subtablecol1,outertable.col1) = 1), there is no other way but to evaluate the subquery for each pair of records).

Quassnoi 2009-06-16 14:50:26

@Quassnoi: My confusion stems from my expectation that Exists was always faster or equivalent to IN, not the other way around, but my SQL-foo is old and weak. Blog more on this please :)

annakata 2009-06-16 15:05:38

@annakata: once a day, as usual :)

Quassnoi 2009-06-16 15:07:42

Answer 3

+2 A:

Aside from going and actually testing it out on a big swath of test data for yourself, I would say use the JOINS. I've always had better performance using them in most cases compared to an IN subquery, and you have a lot more customization options as far as how to join, what is selected, what isn't, etc.

TheTXI 2009-06-16 13:42:41

Answer 4

+2 A:

They are different queries with different results. With the IN query you will get 1 row from table 'a' whenever the predicate matches. With the INNER JOIN query you will get a*b rows whenever the join condition matches. So with values in a of {1,2,3} and b of {1,2,2,3} you will get 1,2,2,3 from the JOIN and 1,2,3 from the IN.

EDIT - I think you may come across a few answers in here that will give you a misconception. Go test it yourself and you will see these are all fine query plans:

create table t1 (t1id int primary key clustered)
create table t2 (t2id int identity primary key clustered
    ,t1id int references t1(t1id)
)


insert t1 values (1)
insert t1 values (2)
insert t1 values (3)
insert t1 values (4)
insert t1 values (5)

insert t2 values (1)
insert t2 values (2)
insert t2 values (2)
insert t2 values (3)
insert t2 values (4)


select * from t1 where t1id in (select t1id from t2)
select * from t1 where exists (select 1 from t2 where t2.t1id = t1.t1id)
select t1.* from t1 join t2 on t1.t1id = t2.t1id

The first two plans are identical. The last plan is a nested loop, this difference is expected because as I mentioned above the join has different semantics.

2009-06-16 13:42:52

+1. Indeed, the two queries would give different results.

Sorin Comanescu 2009-06-16 14:01:38

Answer 5

+3 A:

Neither. Use an ANSI-92 JOIN:

SELECT a.*
FROM a JOIN b a.c = b.d

However, it's best as an EXISTS

SELECT a.*
FROM a
WHERE EXISTS (SELECT * FROM b WHERE a.c = b.d)

This remove the duplicates that could be generated by the JOIN, but runs just as fast if not faster

gbn 2009-06-16 13:43:16

That sub-select is a correlated sub-select. Whether the optimizer handles that efficiently is debatable - usually, they are bad for performance.

Jonathan Leffler 2009-06-16 22:19:05

Assuming indexed columns, I've don't recall having a problem. Plan for plan, it works out similar on our boxes.

gbn 2009-06-17 04:11:39

Answer 6

+12 A:

Update:

This article in my blog summarizes both my answer and my comments to another answers, and shows actual execution plans:

IN vs. JOIN vs. EXISTS

SELECT  *
FROM    a
WHERE   a.c IN (SELECT d FROM b)

SELECT  a.*
FROM    a
JOIN    b
ON      a.c = b.d

These queries are not equivalent. They can yield different results if your table b is not key preserved (i. e. the values of b.d are not unique).

The equivalent of the first query is the following:

SELECT  a.*
FROM    a
JOIN    (
        SELECT  DISTINCT d
        FROM    b
        ) bo
ON      a.c = bo.d

If b.d is UNIQUE and marked as such (with a UNIQUE INDEX or UNIQUE CONSTRAINT), then these queries are identical and most probably will use identical plans, since SQL Server is smart enough to take this into account.

SQL Server can employ one of the following methods to run this query:

If there is an index on a.c, d is UNIQUE and b is relatively small compared to a, then the condition is propagated into the subquery and the plain INNER JOIN is used (with b leading)
If there is an index on b.d and d is not UNIQUE, then the condition is also propagated and LEFT SEMI JOIN is used. It can also be used for the condition above.
If there is an index on both b.d and a.c and they are large, then MERGE SEMI JOIN is used
If there is no index on any table, then a hash table is built on b and HASH SEMI JOIN is used.

Neither of these methods reevaluates the whole subquery each time.

See this entry in my blog for more detail on how this works:

Counting missing rows: SQL Server

There are links for all RDBMS's of the big four.

Quassnoi 2009-06-16 13:45:41

Thanks for the detailed explanation of what's going on behind the scenes.

macleojw 2009-06-16 14:42:47

Answer 7

+1 A:

Observe the execution plan for both types and draw your conclusions. Unless the number of records returned by the subquery in the "IN" statement is very small, the IN variant is almost certainly slower.

thijs 2009-06-16 13:55:32

Answer 8

+1 A:

From MSDN documentation on Subquery Fundamentals:

Many Transact-SQL statements that include subqueries can be alternatively formulated as joins. Other questions can be posed only with subqueries. In Transact-SQL, there is usually no performance difference between a statement that includes a subquery and a semantically equivalent version that does not. However, in some cases where existence must be checked, a join yields better performance. Otherwise, the nested query must be processed for each result of the outer query to ensure elimination of duplicates. In such cases, a join approach would yield better results.

In the example you've provided, the nested query need only be processed a single time for each of the outer query results, so there should be no performance difference. Checking the execution plans for both queries should confirm this.

Note: Though the question itself didn't specify SQL Server 2005, I answered with that assumption based on the question tags. Other database engines (even different SQL Server versions) may not optimize in the same way.

iammichael 2009-06-16 14:08:04

Answer 9

A:

It's generally held that a join would be more efficient than the IN subquery; however the SQL*Server optimizer normally results in no noticeable performance difference. Even so, it's probably best to code using the join condition to keep your standards consistent. Also, if your data and code ever needs to be migrated in the future, the database engine may not be so forgiving (for example using a join instead of an IN subquery makes a huge difference in MySql).

DBMarcos99 2009-06-16 15:10:06

Answer 10

A:

Theory will only get you so far on questions like this. At the end of the day, you'll want to test both queries and see which actually runs faster. I've had cases where the JOIN version took over a minute and the IN version took less than a second. I've also had cases where JOIN was actually faster.

Personally, I tend to start off with the IN version if I know I won't need any fields from the subquery table. If that starts running slow, I'll optimize. Fortunately, for large datasets, rewriting the query makes such a noticeable difference that you can simply time it from Query Analyzer and know you're making progress.

Good luck!

Jason Kester 2009-06-16 18:38:31

Answer 11

+1 A:

Speaking from experience on a Table with 49,000,000 rows I would recommend LEFT OUTER JOIN. Using IN, or EXISTS Took 5 minutes to complete where the LEFT OUTER JOIN finishes in 1 second.

SELECT a.*
FROM a LEFT OUTER JOIN b ON a.c = b.d
WHERE b.d is not null -- Given b.d is a primary Key with index

Actually in my query I do this across 9 tables.

Theofanis Pantelides 2009-06-18 14:48:37

Answer 12

A:

Ive always been a supporter of the IN methodology. This link contains details of a test conducted in PostgresSQL. http://archives.postgresql.org/pgsql-performance/2005-02/msg00327.php

2009-07-04 18:18:41

Answer 13

A:

I've done a comparison on this:

See Oracle vs MySQL vs SQL Server: Aggregation vs Joins

cletus 2009-07-29 13:44:37

ansaurus

tags:

views:

answers:

IN vs. JOIN with large rowsets

related questions