ansaurus

Question

Is WHERE ID IN (1, 2, 3, 4, 5, ...) the most efficient?

Answer 1

+6 A:

Well, it depends on what's really going on. How is the user choosing these IDs?

Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?

It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).

I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?

Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.

Don't forget to old adage (with apologies to Benjamin Franklin):

He who would trade correctness for performance deserves neither

Joel Coehoorn 2009-10-05 20:12:34

We are building a UI in which it will be very simple to select really an arbitrary set. But I’m guessing that in many cases the actual set will be actually several intervals anyway. So we could potentially build the quires as: (1 <= ID AND ID <= 8) OR (38 <= ID AND ID <= 89) ... It’s worth considering this approach. Thanks.

Jan Zich 2009-10-05 20:36:11

Answer 2

+6 A:

Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.

Dean J 2009-10-05 20:13:21

Thanks for pointing me to this. I’ll definitely investigate it.

Jan Zich 2009-10-05 20:38:30

I just tested it with about 20000 items in the IN clause, and it worked. However, when I tested it with 100000 items, SQL Server was thinking for a long time and then said: “The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query.” So there is no explicit limit per se, but the query processor chokes on it.

Jan Zich 2009-10-06 07:41:54

Answer 3

+2 A:

A table var has issues: using a temp table with index has benefits for statistics.

A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.

Parsing a CSV is easy: see questions on right...

gbn 2009-10-05 20:14:39

to parse a CSV string: http://stackoverflow.com/questions/1456192/comparing-a-column-to-a-list-of-values-in-t-sql/1456404#1456404

KM 2009-10-05 20:25:46

Answer 4

A:

For many years I use 3 approach but when I start using OR/M it's seems to be unnecessary.

Even loading each row by id is not as much inefficient like it looks like.

dario-g 2009-10-05 20:15:16

Answer 5

A:

If problems with string manipulation are putted aside, I think that:

WHERE ID=1 OR ID=2 OR ID=3 ...

is more efficient, nevertheless I wouldn't do it.

You could compare performance between both approaches.

Juparave 2009-10-05 20:19:25

No. IN beats OR, because the DB can optimize for that set (basically treat it like a small table).

Joel Coehoorn 2009-10-05 20:24:11

Why do you think that's more efficient than an IN clause?

Eric J. 2009-10-05 23:16:53

After some tests in a 20k rows table 'EXPLAIN SELECT * from product where id in (1,2,3,4,5);' is exactly the same as 'EXPLAIN SELECT * from product where id=1 or id=2 or id=3 or id=4 or id=5;' so I guess I was wrong

Juparave 2009-10-05 23:43:01

Answer 6

+1 A:

Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using

WHERE IN

or

WHERE EXISTS

or

JOIN someholdingtable ON ...

Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (@param1, @param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.

So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.

Stuart Ainsworth 2009-10-05 20:23:24

Answer 7

+3 A:

The IN clause does not guaranties a INDEX SEEK. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.

Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.

Rodrigo 2009-10-05 20:23:50

Answer 8

A:

To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.

Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:

CREATE PROCEDURE myProc(@MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
    DECLARE @x XML
    SELECT @x = CONVERT(XML, @MyXmlAsSTR)

Then you can join your base query directly with the XML select as (not tested):

SELECT      t.*
FROM        myTable t
INNER JOIN  @x.nodes('/ROOT/ROW') AS R(x)
        ON  t.ID = x.value('@ID', 'INTEGER')

when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>. Just remember that XML is CaSe-SensiTiv.

van 2009-10-05 20:43:05

Really not sure this is a very efficient solution. Interesting maybe, but not efficient

Scott Ferguson 2009-10-05 20:50:01

I agree. I does not look as efficient, but it looks quite elegant. You don't make your hands dirty with string manipulation (at least not directly).

Jan Zich 2009-10-05 21:06:41

What do you mean "does not look as efficient"? have you tested it?Given the native support for XML datatype since sql-2005 (much improved since sql2k, btw), I seriously doubt that string-splitting functions that most people use is really "more" efficient.But tastes differ, I guess. As do **prejudices** (some think **slow** as soon as they see **xml**).

van 2009-10-06 05:30:31

Sorry, I had not tested it before. It was just an intuition. However, I’ve just tested it on my data set (selecting about 100 rows from 10000), and the XML parsing / XPath processing took 99% of the time. In short, the 100 index seeks were still there (as in the case of the WHERE IN), but the XML manipulation was the most time consuming part.

Jan Zich 2009-10-06 07:59:31

fair enough. my guess is that also conversion of comma-separated values into a table will take the same 99% of the time. especially because index seeks are fast, especially on PKs (clustered).

van 2009-10-06 11:39:35

Answer 9

A:

select t.*
from (
    select id = 35 union all
    select id = 87 union all
    select id = 445 union all
    ...
    select id = 33643
) ids
join my_table t on t.id = ids.id

If the set of ids to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.

Note that query engines tend to treat

select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)

as equivalent to

select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643

and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.

Justice 2009-10-05 20:48:01

I'm not sure, but I think this may be a bit more performant if you used union all instead of a simple union. I think that when you use a straight union, it has to test to see if this is a duplicate first.

wcm 2009-10-05 20:58:09

Thanks - fixed.

Justice 2009-10-05 21:18:49

Answer 10

+2 A:

For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.

Following up on a note from Jan Zich regarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:

XML solution: 99% time - xml parsing
comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
CLR UDF to split string: 98% - index seek.

Here is the code for CLR UDF:

public class SplitString
{
    [SqlFunction(FillRowMethodName = "FillRow")]
    public static IEnumerable InitMethod(String inputString)
    {
        return inputString.Split(',');
    }

    public static void FillRow(Object obj, out int ID)
    {
        string strID = (string)obj;
        ID = Int32.Parse(strID);
    }
}

So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.

I tested the search of 1K record in a table of 200K.

van 2009-10-06 13:05:08

+1: Thank you. This is very interesting. I have been aware of the ability to run .NET in SQL, but I’ve never used it, so this did not come to my mind.

Jan Zich 2009-10-06 13:11:50

Answer 11

+1 A:

Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).

There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.

Jonathan Leffler 2009-10-06 21:11:30

ansaurus

tags:

views:

answers:

Is WHERE ID IN (1, 2, 3, 4, 5, ...) the most efficient?

related questions