tags:

views:

212

answers:

6

I need to select a large number of rows from an even larger table which is keyed on an autoincrement identity column. I have the primary key values for each of the rows that I'm trying to select, but it could be very large.

Often, but not always, the rows that are being selected are contiguous, so I implemented a mechanism that converts the select into a set of range clauses for all of the entries that are clumped together ([PrimaryKey] BETWEEN 151 AND 217), and a fallback method that selects all of the isolated entries with an IN clause.

In the end I get something like this

SELECT * FROM Table WHERE ([PrimaryKey] BETWEEN 151 AND 217) OR ([PrimaryKey] BETWEEN 314 AND 378) OR ...
OR [PrimaryKey] IN (1,3,7,14,147...)

This works great for the cases where I have mostly large ranges, but breaks down as the query gets larger. I just ran across a degenerate case where I had a large number of "pairs" of entries that generated BETWEEN statements for 2 entries at a time that took more than 15 minutes trying to describe the execution plan before I gave up on it.

The first thing that comes to mind is that I can change the threshold for when I start generating ranges as opposed to individual values to something more than 2 (10 perhaps?), but I was wondering if there is a better solution out there.

+7  A: 

Create a Temp table with the values you want to select and perform a join from your main table to the temp table. This way you have practically no limit.

Otávio Décio
could also use a table variable here too
Russ Cam
+1  A: 

As user ocdecio correctly proposes, you could join to a tempory table which would contain all IDs. You could also try to split up the ORs into different UNION parts:

SELECT * FROM Table WHERE [PrimaryKey] BETWEEN 151 AND 217
UNION
SELECT * FROM Table WHERE [PrimaryKey] BETWEEN 314 AND 378
UNION
...
UNION
SELECT * FROM Table WHERE [PrimaryKey] IN (1,3,7,14,147...)
splattne
+1  A: 

I can understand the intention of your mechanism. In practice, however, the presence of an index and faith in the optimiser is usually better.

If you were to create a dozen conditions in the WHERE clause, the query engine needs to check each one of them until it finds a match.

Equally, creating multiple queries and unioning them would mean index scans or index lookups on the table multiple times.

It is true, however, that an IN clause can get very slow when you have a large list. In that case using a join is usually faster. From experience, this is always my preferred option.

Yet it is true that the use of BETWEEN is particulary efficient for larger ranges. With that in mind it may be beneficial to use the UNION mechanism WHERE the first recorset uses a JOIN and the rest use BETWEEN, provided that the BETWEEN is of a significant range.

A significant consideration is the time to prepare such a query. If the SQL Server has to generates a dynamic query, using T-SQL, there will be two over heads; The time to generate the query, and the time to parse it then generate the execution plan. The first would dominate for large lists, and may cost more time than is saved by using BETWEEN.

If the client generates the dynamic query, you could argue that you are transfering some load from the server to the client. Though I am still sceptical that the benefits would be significant.

Ass such, unless I saw very noticable performance increases, I would stick to the join. The principle reasons being ones of engineering;
- Time to develop
- Reliability of code (simple is always more reliable the clever tricks)
- Maintainability of code (will subsequent maintainers understand the trick?)

If you do test various combinations of JOIN and BETWEEN, with or without UNION, etc, I'd be very interested to see your performance results.

Dems
A: 

This is a similar problem to what I recently faced. I needed to select from a table those rows that intersected a large list of primary keys. The question was how to efficently send the large list of keys to the SQL server in a form that is fast and efficent to use.

What works really well for this type of situation is the XML data type. If you create a stored procedure that takes one parameter of type XML, you can pre-format the input into a XML fragment. As an example, let's say the XML fragment will look like this:

<a>
  <b>1</b>
  <b>3</b>
  <b>7</b>
  <b>14</b>
  <b>147</b>
</a>

I gave the elements short names ("a" and "b") because a longer name would mean more bytes to transmit from the client to the SQL server. Here is how you would select all the contents of the "b" elements as a record set:

declare @x xml
set @x = '<a><b>1</b><b>3</b><b>7</b><b>14</b><b>147</b></a>'
select t.item.value('.', 'int') from @x.nodes('//a/b') as t(item)

Although the syntax is cryptic, the XML type can be queried just like a table. Now you should see where this is going. If we can query the XML type like a table, we can intersect that with another table:

select * from MyTable where ID in 
  (select t.item.value('.', 'int') from @x.nodes('//a/b') as t(item))

or using a join

;with cte as
  (select ID = t.item.value('.', 'int') from @x.nodes('//a/b') as t(item)) 
select * from MyTable inner join cte on MyTable.ID = cte.ID

You need to run both versions to see which will be faster for your data. I find the JOIN works faster with my data. Here is a stored procedure that takes the XML type as input and spits back our selected rows:

create procedure MyProc @x xml as
begin
  set nocount on
  ;with cte as
    (select ID = t.item.value('.', 'int') from @x.nodes('//a/b') as t(item)) 
  select * from MyTable inner join cte on Table.ID = cte.ID
end

A sample call of the new stored procedure:

exec MyProc '<a><b>1</b><b>3</b><b>7</b><b>14</b><b>147</b></a>'

I also found that adding a XML schema for the input fragment helped to speed up the stored procedure slightly. I won't go into the details of XML schemas here, but the idea is to tell SQL beforehand what the XML fragment will look like. Here's how we would input our schema:

create xml schema collection MyInputSchema as
  '<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"&gt;
    <xsd:element name="a">
      <xsd:complexType>
        <xsd:complexContent>
          <xsd:restriction base="xsd:anyType">
            <xsd:sequence>
              <xsd:element name="b" type="xsd:integer" maxOccurs="unbounded" />
            </xsd:sequence>
          </xsd:restriction>
        </xsd:complexContent>
      </xsd:complexType>
    </xsd:element>
  </xsd:schema>'

Now we can associate this schema with the input to our stored procedure like this:

create procedure MyProc @x xml(MyInputSchema) as
begin
  set nocount on
  ;with cte as
    (select ID = t.item.value('.', 'int') from @x.nodes('//a/b') as t(item)) 
  select * from MyTable inner join cte on Table.ID = cte.ID
end

With all this in place, I was able to send a XML fragment of 43,016 characters from my client machine to the SQL server and get back a result set very quickly. I made a test of 1,000 requests on 10 threads for a total of 10,000 requests. The result was 72 requests processed per second. Of course your millage will vary depending on your hardware and software.

NOTE: This code works on SQL 2005 and should also work on 2008.

Charles
A: 

How do you determine the list of primary key values to select on? I am wondering if we should look at this further 'upstream' - is there some earlier query or queries you ran to arrive at the list of key's? If so, perhaps you could do the join from that and skip the key-lookup altogether.

n8wrl
A: 

My first thought would be to look at how you are pulling in the list of keys. Is it from a query elsewhere? If so, have you tried

SELECT * FROM Table WHERE [PrimaryKey] in
(
    SELECT PrimaryKey from SomeOtherTable where Condition = 'met'
)

or if your conditions are in the same table, it's even simpler

SELECT * FROM Table WHERE condition = 'met'

Of course, if your keys are all coming from another source (your app, maybe), then I personally would stick with one of the previously suggested methods, or maybe bite the bullet and use one big IN clause.

BTW - sorry if I'm teaching anyone to suck eggs, it's just that my experience has been that sometimes the simplest solutions get overlooked.

ZombieSheep