views:

641

answers:

11

I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT’s and UPDATE’s) based on a user supplied list of ID’s. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID’s can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).

If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...) construct and build queries dynamically. The core of the problem is the input lists of ID’s will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...) anyway.

One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID’s to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID’s into the table variable and joining the master table on it, i.e. something like this:

-- 1. Temporary table for ID’s:
DECLARE @IDS TABLE (ID int);

-- 2. Split the given string of ID’s, and each ID to @IDS.
-- Omitted for brevity.

-- 3. Join the main table to @ID’s:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN @IDS ON MyTable.ID = @IDS.ID;

Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that’s nice, but I just need a list of the ID’s”, and it scans the table variable @IDS, and then does n seeks in MyTable where n is the number of the ID’s. I’ve done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT’s are for nothing.

Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I’m saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...) is the most straightforward way to ask for it.

+6  A: 

Well, it depends on what's really going on. How is the user choosing these IDs?

Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?

It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).

I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?

Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.

Don't forget to old adage (with apologies to Benjamin Franklin):

He who would trade correctness for performance deserves neither

Joel Coehoorn
We are building a UI in which it will be very simple to select really an arbitrary set. But I’m guessing that in many cases the actual set will be actually several intervals anyway. So we could potentially build the quires as: (1 <= ID AND ID <= 8) OR (38 <= ID AND ID <= 89) ... It’s worth considering this approach. Thanks.
Jan Zich
+6  A: 

Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.

Dean J
Thanks for pointing me to this. I’ll definitely investigate it.
Jan Zich
I just tested it with about 20000 items in the IN clause, and it worked. However, when I tested it with 100000 items, SQL Server was thinking for a long time and then said: “The query processor ran out of internal resources and could not produce a query plan. This is a rare event and only expected for extremely complex queries or queries that reference a very large number of tables or partitions. Please simplify the query.” So there is no explicit limit per se, but the query processor chokes on it.
Jan Zich
+2  A: 

A table var has issues: using a temp table with index has benefits for statistics.

A table var is assumed to always have one row, whereas a temp table has stats the optimiser can use.

Parsing a CSV is easy: see questions on right...

gbn
to parse a CSV string: http://stackoverflow.com/questions/1456192/comparing-a-column-to-a-list-of-values-in-t-sql/1456404#1456404
KM
A: 

For many years I use 3 approach but when I start using OR/M it's seems to be unnecessary.

Even loading each row by id is not as much inefficient like it looks like.

dario-g
A: 

If problems with string manipulation are putted aside, I think that:

WHERE ID=1 OR ID=2 OR ID=3 ...

is more efficient, nevertheless I wouldn't do it.

You could compare performance between both approaches.

Juparave
No. IN beats OR, because the DB can optimize for that set (basically treat it like a small table).
Joel Coehoorn
Why do you think that's more efficient than an IN clause?
Eric J.
After some tests in a 20k rows table 'EXPLAIN SELECT * from product where id in (1,2,3,4,5);' is exactly the same as 'EXPLAIN SELECT * from product where id=1 or id=2 or id=3 or id=4 or id=5;' so I guess I was wrong
Juparave
+1  A: 

Essentially, I would agree with your observation; SQL Server's optimizer will ultimately pick the best plan for analyzing a list of values and it will typically equate to the same plan, regardless of whether or not you are using

WHERE IN

or

WHERE EXISTS

or

JOIN someholdingtable ON ...

Obviously, there are other factors which influence plan choice (like covering indexes, etc). The reason that people have various methods for passing in this list of values to a stored procedure is that before SQL 2008, there really was no simple way of passing in multiple values. You could do a list of parameters (WHERE IN (@param1, @param2)...), or you could parse a string (the method you show above). As of SQL 2008, you can also pass table variables around, but the overall result is the same.

So yes, it doesn't matter how you get the list of variables to the query; however, there are other factors which may have some effect on the performance of said query once you get the list of variables in there.

Stuart Ainsworth
+3  A: 

The IN clause does not guaranties a INDEX SEEK. I faced this problem before using SQL Mobile edition in a Pocket with very few memory. Replacing IN (list) with a list of OR clauses boosted my query by 400% aprox.

Another approach is to have a temp table that stores the ID's and join it against the target table, but if this operation is used too often a permanent/indexed table can help the optimizer.

Rodrigo
A: 

To answer the question directly, there is no way to pass a (dynamic) list of arguments to an SQL Server 2005 procedure. Therefore what most people do in these cases is passing a comma-delimited list of identifiers, which I did as well.

Since sql 2005 though I prefer passing and XML string, which is also very easy to create on a client side (c#, python, another SQL SP), and "native" to work with since 2005:

CREATE PROCEDURE myProc(@MyXmlAsSTR NVARCHAR(MAX)) AS BEGIN
    DECLARE @x XML
    SELECT @x = CONVERT(XML, @MyXmlAsSTR)

Then you can join your base query directly with the XML select as (not tested):

SELECT      t.*
FROM        myTable t
INNER JOIN  @x.nodes('/ROOT/ROW') AS R(x)
        ON  t.ID = x.value('@ID', 'INTEGER')

when passing <ROOT><ROW ID="1"/><ROW ID="2"/></ROOT>. Just remember that XML is CaSe-SensiTiv.

van
Really not sure this is a very efficient solution. Interesting maybe, but not efficient
Scott Ferguson
I agree. I does not look as efficient, but it looks quite elegant. You don't make your hands dirty with string manipulation (at least not directly).
Jan Zich
What do you mean "does not look as efficient"? have you tested it?Given the native support for XML datatype since sql-2005 (much improved since sql2k, btw), I seriously doubt that string-splitting functions that most people use is really "more" efficient.But tastes differ, I guess. As do **prejudices** (some think **slow** as soon as they see **xml**).
van
Sorry, I had not tested it before. It was just an intuition. However, I’ve just tested it on my data set (selecting about 100 rows from 10000), and the XML parsing / XPath processing took 99% of the time. In short, the 100 index seeks were still there (as in the case of the WHERE IN), but the XML manipulation was the most time consuming part.
Jan Zich
fair enough. my guess is that also conversion of comma-separated values into a table will take the same 99% of the time. especially because index seeks are fast, especially on PKs (clustered).
van
A: 
select t.*
from (
    select id = 35 union all
    select id = 87 union all
    select id = 445 union all
    ...
    select id = 33643
) ids
join my_table t on t.id = ids.id

If the set of ids to search on is small, this may improve performance by permitting the query engine to do an index seek. If the optimizer judges that a table scan would be faster than, say, one hundred index seeks, then the optimizer will so instruct the query engine.

Note that query engines tend to treat

select t.*
from my_table t
where t.id in (35, 87, 445, ..., 33643)

as equivalent to

select t.*
from my_table t
where t.id = 35 or t.id = 87 or t.id = 445 or ... or t.id = 33643

and note that query engines tend not to be able to perform index seeks on queries with disjunctive search criteria. As an example, Google AppEngine datastore will not execute a query with a disjunctive search criteria at all, because it will only execute queries for which it knows how to perform an index seek.

Justice
I'm not sure, but I think this may be a bit more performant if you used union all instead of a simple union. I think that when you use a straight union, it has to test to see if this is a duplicate first.
wcm
Thanks - fixed.
Justice
+2  A: 

For me the IN (...) is not the preferred option due to many reasons, including the limitation on the number of parameters.

Following up on a note from Jan Zich regarding the performance using various temp-table implementations, here are some numbers from SQL execution plan:

  • XML solution: 99% time - xml parsing
  • comma-separated procedure using UDF from CodeProject: 50% temp table scan, 50% index seek. One can agrue if this is the most optimal implementation of string parsing, but I did not want to create one myself (I will happily test another one).
  • CLR UDF to split string: 98% - index seek.

Here is the code for CLR UDF:

public class SplitString
{
    [SqlFunction(FillRowMethodName = "FillRow")]
    public static IEnumerable InitMethod(String inputString)
    {
        return inputString.Split(',');
    }

    public static void FillRow(Object obj, out int ID)
    {
        string strID = (string)obj;
        ID = Int32.Parse(strID);
    }
}

So I will have to agree with Jan that XML solution is not efficient. Therefore if comma-separated list is to be passed as a filter, simple CLR UDF seems be optimal in terms of performance.

I tested the search of 1K record in a table of 200K.

van
+1: Thank you. This is very interesting. I have been aware of the ability to run .NET in SQL, but I’ve never used it, so this did not come to my mind.
Jan Zich
+1  A: 

Once upon a long time ago, I found that on the particular DBMS I was working with, the IN list was more efficient up to some threshold (which was, IIRC, something like 30-70), and after that, it was more efficient to use a temp table to hold the list of values and join with the temp table. (The DBMS made creating temp tables very easy, but even with the overhead of creating and populating the temp table, the queries ran faster overall.) This was with up-to-date statistics on the main data tables (but it also helped to update the statistics for the temp table too).

There is likely to be a similar effect in modern DBMS; the threshold level may well have changed (I am talking about depressingly close to twenty years ago), but you need to do your measurements and consider your strategy or strategies. Note that optimizers have improved since then - they may be able to make sensible use of bigger IN lists, or automatically convert an IN list into an anonymous temp table. But measurement will be key.

Jonathan Leffler