views:

887

answers:

7

I've heard lots of people saying that the IN keyword in most relational databases is slow. How true is this? An example query would be this, off the top of my head:

SELECT * FROM someTable WHERE someColumn IN (value1, value2, value3)

I've heard that is much slower than doing this:

SELECT * FROM someTable WHERE
  someColumn = value1 OR
  someColumn = value2 OR
  someColumn = value3

Is this true? Or is the speed difference negligible? If it matters, I'm using PostgreSQL, but I'd also like to know how MySQL fares (and if it's any different). Thanks in advance.

A: 

It says in the docs that IN is very fast in MySQL but I can't find the source at the moment.

Greg
I also recall reading in the docs that the IN operator is well optimized in mysql
ʞɔıu
A: 

The speed of the IN keyword would really depend on the complexity of your subquery. In the example you provide you just want to see if someColumns value is in a set list of values, and a pretty short one at that. So I would imagine that the performance cost would be very minimal in that case.

Matthew Vines
I edited my question to show what I'm comparing against.
musicfreak
+1  A: 

IN with a subselect is often slow. IN with a value list shouldn't be any slower than someColumn = value1 OR someColumn = value2 OR someColumn = value3, etc. That is plenty fast, as long as the number of values is sane.

IN with a subquery is slow when the optimizer can't figure out a good way to perform the query, and has to use the obvious method of building the full result of the subquery. For example:

SELECT username
  FROM users
  WHERE userid IN (
    SELECT userid FROM users WHERE user_first_name = 'Bob'
  )

is going to be much slower than

SELECT username FROM users WHERE user_first_name = 'Bob'

unless the optimizer can figure out what you meant.

derobert
+6  A: 

In PostgreSQL, exactly what you'll get here depends on the underlying table, so you should use EXPLAIN ANALYZE on some sample queries against a useful subset of your data to figure out exactly what the optimizer is going to do (make sure the tables you're running against have been ANALYZEd too). IN can be processed a couple of different ways, and that's why you need to look at some samples to figure out which alternative is being used for your data. There is no simple generic answer to your question.

As for the specific question you added in your revision, against a trivial data set with no indexes involved here's an example of the two query plans you'll get:

postgres=# explain analyze select * from x where s in ('123','456');
 Seq Scan on x  (cost=0.00..84994.69 rows=263271 width=181) (actual time=0.015..1819.702 rows=247823 loops=1)
   Filter: (s = ANY ('{123,456}'::bpchar[]))
 Total runtime: 1931.370 ms

postgres=# explain analyze select * from x where s='123' or s='456';
 Seq Scan on x  (cost=0.00..90163.62 rows=263271 width=181) (actual time=0.014..1835.944 rows=247823 loops=1)
   Filter: ((s = '123'::bpchar) OR (s = '456'::bpchar))
 Total runtime: 1949.478 ms

Those two runtimes are essentially identical, because the real processing time is dominated by the sequential scan across the table; running multiple times shows the difference between the two is below the run to run margin of error. As you can see, PostgreSQL transforms the IN case into using its ANY filter, which should always execute faster than a series of ORs. Again, this trivial case is not necessarily representative of what you'll see on a serious query where indexes and the like are involved. Regardless, manually replacing INs with a series of OR statements should never be faster, because the optimizer knows the best thing to do here if it has good data to work with.

In general, PostgreSQL knows more tricks for how to optimize complicated queries than the MySQL optimizer does, but it also relies heavily on your having given the optimizer enough data to work with. The first links on the "Performance Optimization" section of the PostgreSQL wiki covers the most important things needed to get good results from the optimizer.

Greg Smith
+3  A: 

In MySQL, these are complete synonyms for the optimizer:

SELECT  *
FROM    someTable
WHERE   someColumn IN (value1, value2, value3)

and

SELECT  *
FROM    someTable
WHERE   someColumn = value1 OR
        someColumn = value2 OR
        someColumn = value3

, provided that value's are literal contants or preset variables.

According to the documentation:

The definition of a range condition for a single-part index is as follows:

  • For both BTREE and HASH indexes, comparison of a key part with a constant value is a range condition when using the =, <=>, IN(), IS NULL, or IS NOT NULL operators.
  • For all types of indexes, multiple range conditions combined with OR or AND form a range condition.

“Constant value” in the preceding descriptions means one of the following:

  • A constant from the query string
  • A column of a const or system table from the same join
  • The result of an uncorrelated subquery
  • Any expression composed entirely from subexpressions of the preceding types

However, this query:

SELECT  *
FROM    table
WHERE   id = 1
        OR id = (SELECT id FROM other_table WHERE unique_condition)

will use the index on id, while this one:

SELECT  *
FROM    table
WHERE   id IN (1, (SELECT id FROM other_table WHERE unique_condition))

will use fullscan.

I. e. there is difference when one of the value's is a single-row subquery.

I've filed it recently as bug 45145 in MySQL (it turned out to be 5.2 specific, absent in 5.1 and corrected in 6.0)

Quassnoi
+1  A: 

Using IN isn't necessarily slow, it's how you build the IN parameters that will slow things down considerably. Too often people use SELECT ... WHERE x IN (SELECT..., which can be very poorly optimized (i.e. not at all). Do a search on "correlated subquery" to see how bad it can get.

Often you don't have to use IN at all and can use a JOIN instead, and take advantage of derived tables.

SELECT * FROM table1 WHERE x IN (SELECT y FROM table2 WHERE z=3)

Can be rephrased like this

SELECT * FROM table1 JOIN (SELECT y FROM table2 WHERE z=3) AS table2 ON table1.x=table2.y

If the IN syntax is slow, the JOIN syntax will often times be much faster. You can use EXPLAIN to see how each query would be optimized differently. This is a simplistic example and your database may show the same query path, but more complicated queries usually show something different.

Brent Baisley
+1  A: 

I think you got the answer(s) you wanted above. Just wanted to add one thing.

You need to optimize IN and use it the right way. In development, I always set up a debug section at the bottom of the page anytime there is a query and it automatically runs an EXPLAIN EXTENDED on every SELECT and then SHOW WARNINGS in order to see the (likely) way that MySQL's Query Optimizer will rewrite the query internally. Lots to learn from that on how to make sure IN is working for you.

joedevon