views:

167

answers:

7

I have an array with a huge amounts of IDs I would like to select out from the DB.

The usual approach would be to do select blabla from xxx where yyy IN (ids) OPTION (RECOMPILE). (The option recompile is needed, because SQL server is not intelligent enough to see that putting this query in its query cache is a huge waste of memory)

However, SQL Server is horrible at this type of query when the amount of IDs are high, the parser that it uses to simply too slow. Let me give an example:

SELECT * FROM table WHERE id IN (288525, 288528, 288529,<about 5000 ids>, 403043, 403044) OPTION (RECOMPILE)

Time to execute: ~1100 msec (This returns appx 200 rows in my example)

Versus:

SELECT * FROM table WHERE id BETWEEN 288525 AND 403044 OPTION (RECOMPILE)

Time to execute: ~80 msec (This returns appx 50000 rows in my example)

So even though I get 250 times more data back, it executes 14 times faster...

So I built this function to take my list of ids and build something that will return a reasonable compromise between the two (something that doesn't return 250 times as much data, yet still gives the benefit of parsing the query faster)

  private const int MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH = 5;
  public static string MassIdSelectionStringBuilder(
       List<int> keys, ref int startindex, string colname)
  {
     const int maxlength = 63000;
     if (keys.Count - startindex == 1)
     {
        string idstring = String.Format("{0} = {1}", colname, keys[startindex]);
        startindex++;
        return idstring;
     }
     StringBuilder sb = new StringBuilder(maxlength + 1000);
     List<int> individualkeys = new List<int>(256);
     int min = keys[startindex++];
     int max = min;
     sb.Append("(");
     const string betweenAnd = "{0} BETWEEN {1} AND {2}\n";
     for (; startindex < keys.Count && sb.Length + individualkeys.Count * 8 < maxlength; startindex++)
     {
        int key = keys[startindex];
        if (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH)
        {
           if (min == max)
              individualkeys.Add(min);
           else
           {
              if(sb.Length > 2)
                 sb.Append(" OR ");
              sb.AppendFormat(betweenAnd, colname, min, max);
           }
           min = max = key;
        }
        else
        {
           max = key;
        }
     }
     if (min == max)
        individualkeys.Add(min);
     else
     {
        if (sb.Length > 2)
           sb.Append(" OR ");
        sb.AppendFormat(betweenAnd, colname, min, max);
     }
     if (individualkeys.Count > 0)
     {
        if (sb.Length > 2)
           sb.Append(" OR ");
        string[] individualkeysstr = new string[individualkeys.Count];
        for (int i = 0; i < individualkeys.Count; i++)
           individualkeysstr[i] = individualkeys[i].ToString();
        sb.AppendFormat("{0} IN ({1})", colname,  String.Join(",",individualkeysstr));
     }
     sb.Append(")");
     return sb.ToString();
  }

It is then used like this:

 List<int> keys; //Sort and make unique
 ...
 for (int i = 0; i < keys.Count;)
 {
    string idstring = MassIdSelectionStringBuilder(keys, ref i, "id");
    string sqlstring = string.Format("SELECT * FROM table WHERE {0} OPTION (RECOMPILE)", idstring);

However, my question is... Does anyone know of a better/faster/smarter way to do this?

A: 

Adding recompile not a good idea. Precompiling means sql does not save your query results but it saves the execution plan. Thereby trying to make the query faster. If you add recompile then it will have the overhead of compiling the query always. Try creating a stored procedure and saving the query and calling it from there. As stored procedures are always precompiled.

zapping
You miss the point.The recompile is used to FORCE the query cache to NOT store it. A query like this will easily take up ½ megabyte of memory. And it makes 1 query cache for every time you do this with a different length in the IN. Meaning if you do 6 selection of, first 300 ids, then 301, then 302, 500, 600, 800, you will have used perhaps 6MB of SQL server memory for nothing... Actually making the query plan for this super trivial query takes sql server almost zero time since there is usually only a single index to consider, and there are no joins.
Cine
http://blogs.msdn.com/queryoptteam/archive/2006/03/31/565991.aspx
zapping
Are you sure that the recompile makes the query not to cache or query plan not to cache? And how do you get the ids? And are they constant always or do they change?
zapping
Yes. Try it yourself.DBCC FREEPROCCACHE; select * from table where id in (1); select * from table where id in (1,2); select * from table where id in (1,2,3) OPTION(RECOMPILE); SELECT sqlt.TEXT AS 'Cached query', qstat.execution_count, cplan.size_in_bytes, qstat.last_worker_time, qstat.max_worker_time, qstat.last_execution_timeFROM sys.dm_exec_cached_plans cplanINNER join sys.dm_exec_query_stats qstat ON cplan.plan_handle = qstat.plan_handlecross apply sys.dm_exec_sql_text(qstat.sql_handle) sqlt option(recompile). The last one with 3 ids wont be in the query cache
Cine
+1  A: 

If the list of Ids were in another table that was indexed, this would execute a whole lot faster using a simple INNER JOIN

if that isn't possible then try creating a TABLE variable like so

DECLARE @tTable TABLE
(
   @Id int
)

store the ids in the table variable first, then INNER JOIN to your table xxx, i have had limited success with this method, but its worth the try

Neil
@tTable idea: That would work... If it wasn't for the fact that you would need 5000 INSERTS to put it in the temp table in the first place. The result is EVEN slower than anything else.
Cine
INNER JOIN idea: Yes, this is also useful. However, it is only valid when either 1. you actually have it in a table, or 2. The speed of selecting those IDs is low or don't actually need the ids for anything.
Cine
A: 

Another dirty idea similar to Neils,

  • Have a indexed view which holds the IDs alone based on your business condition
  • And you can join the view with your actual table and get the desired result.
Ramesh Vel
If there was a constant number of IDs then it situation would be alot simpler.It is possible to make an indexed view, defined as (select 12 union all select 13 etc), but 1. that will require you to take a schema modification lock (which would block ALL other threads in your DB) and 2. you would need to build the index on the view and 3. The creation query of the view would be even longer than the IN query, thus the parser would take even longer.
Cine
+1  A: 

You're using (key > max+MAX_NUMBER_OF_EXTRA_OBJECTS_TO_FETCH) as the check to determine whether to do a range fetch instead of an individual fetch. It appears that's not the best way to do that.

let's consider the 4 ID sequences {2, 7}, {2,8}, {1,2,7}, and {1,2,8}. They translate into

ID BETWEEN 2 AND 7
ID ID in (2, 8)
ID BETWEEN 1 AND 7 
ID BETWEEN 1 AND 2 OR ID in (8)

The decision to fetch and filter the IDs 3-6 now depends only on the difference between 2 and 7/8. However, it does not take into account whether 2 is already part of a range or a individual ID.

I think the proper criterium is how many individual IDs you save. Converting two individuals into a range removes has a net benefit of 2 * Cost(Individual) - Cost(range) whereas extending a range has a net benefit of Cost(individual) - Cost(range extension).

MSalters
I agree, it is not optimal. I dont agree with your cost calculation though. Selecting a range of 5 has almost the same cost as selecting 1, since the execution of it has to go down in the BTREE index and the overhead of doing that is rather high. For this particular table I am working on, it has a tiny row size, so I could extend the range to 50 and the time it takes to execute the query stays about the same. But then I get rather many results.
Cine
A: 

The efficient way to do this is to:

  1. Create a temporary table to hold the IDs
  2. Call a SQL stored procedure with a string parameter holding all the comma-separated IDs
  3. The SQL stored procedure uses a loop with CHARINDEX() to find each comma, then SUBSTRING to extract the string between two commas and CONVERT to make it an int, and use INSERT INTO @Temporary VALUES ... to insert it into the temporary table
  4. INNER JOIN the temporary table or use it in an IN (SELECT ID from @Temporary) subquery

Every one of these steps is extremely fast because a single string is passed, no compilation is done during the loop, and no substrings are created except the actual id values.

No recompilation is done at all when this is executed as long as the large string is passed as a parameter.

Note that in the loop you must tracking the prior and current comma in two separate values

RobC
Unfortunately the time it takes to parse the string is very high. There are 3 basic ways to do it. 1. as you describe with CHARINDEX etc, this is extremely slow. 2. Use XML: CREATE FUNCTION dbo.SplitIdXml( @idslist xml)returns tableASRETURN (select T.c.value('.', 'int') as Id from @idslist.nodes('/b') AS T(c))GOselect * from dbo.SplitIdXml('<b>123</b><b>124</b>'). This has a decent speed, but time trials show that it is still ~3 times slower than the BETWEEN solution. 3. Use SQLCLR to parse the string, this ought to be fast, but there is some overhead in table valued functions so slow.
Cine
+1  A: 

In my experience the fastest way was to pack numbers in binary format into an image. I was sending up to 100K IDs, which works just fine:

Mimicking a table variable parameter with an image

Yet is was a while ago. The following articles by Erland Sommarskog are up to date:

Arrays and Lists in SQL Server

AlexKuznetsov
That last link is really useful!
Cine
A: 

Off the cuff here - does incorporating a derived table help performance at all? I am not set up to test this fully, just wonder if this would optimize to use between and then filter the unneeded rows out:

Select * from 
( SELECT *
  FROM dbo.table 
  WHERE ID between <lowerbound> and <upperbound>) as range
where ID in ( 
    1206,
    1207,
    1208,
    1209,
    1210,
    1211,
    1212,
    1213,
    1214,
    1215,
    1216,
    1217,
    1218,
    1219,
    1220,
    1221,
    1222,
    1223,
    1224,
    1225,
    1226,
    1227,
    1228,
    <...>,
    1230,
    1231
)
onupdatecascade
In your form, no. Any decent sql query optimizer will notice that it can push the ID selection into the inner query, and also eliminate the useless derived table. You could FORCE it to do what you want, by adding a TOP 99999999 to the inner table, but the result is much slower.
Cine
Ah, well, it was a stab. I thought a "decent sql query optimizer" might find that there are two conditions (in and between), and perhaps note that the range scan for between was less expensive, and process that first. Long shot I guess.
onupdatecascade