views:

280

answers:

2

Alright SQL Server Gurus, fire up your analyzers.

  • I have a list of titles in application memory (250 or so).
  • I have a database table "books" with greater than a million records, one of the columns of the table is "title" and is of type nvarchar.
  • the "books" table has another column called "ISBN"
  • books.title is not a primary key, is not unique, but is indexed.

So I'd like to know which is more efficient:

WITH titles AS (select 'Catcher and the Rye' as Title
                union all 'Harry Potter ...'
                ...
                union all 'The World Is Flat')

select ISBN from books, titles where books.title = titles.title;

OR:

select ISBN from books where title in ('Catcher and the Rye','Harry Potter',...,'The World Is Flat');

OR:

???
+2  A: 

I hope you have ISBN includes on the title index to avoid key lookups

CREATE INDEX IX_Titles ON dbo.Books (Title) INCLUDE (ISBN)

Now, the IN vs JOIN vs EXISTs is a common question here. The CTE is irrelevant except for readability. Personally, I'd use exists because you'll get duplicates with JOIN for books with the same title, which folk often forget.

;WITH titles AS (select 'Catcher and the Rye' as Title
            union all 'Harry Potter ...'
            ...
            union all 'The World Is Flat')
SELECT
    ISBN 
FROM
    books
WHERE
    EXISTS (SELECT * --or null or = all the same
        FROM
            titles 
        WHERE
            titles .Title = books.Title)

However, one construct I'd consider is this to force "intermediate materialisation" on my list of search titles. The also applies to an exists or CTE solution too. This is likely to help the optimiser considerably.

Edit: a temp table is a better option, really, as Steve mentioned in his comment

SELECT
    ISBN 
FROM
    (
    SELECT TOP 2000000000
        Title
    FROM
        (select 'Catcher and the Rye' as Title
                union all 'Harry Potter ...'
                ...
                union all 'The World Is Flat'
        ) foo
    ORDER BY
       Title
    ) bar
    JOIN
    books On bar.Title = books.Title


SELECT
    ISBN 
FROM
    books
WHERE
    EXISTS (SELECT * --or null or = all the same
        FROM
            (
            SELECT TOP 2000000000
                Title
            FROM
                (select 'Catcher and the Rye' as Title
                        union all 'Harry Potter ...'
                        ...
                        union all 'The World Is Flat'
                ) foo
            ORDER BY
               Title
            ) bar
        WHERE
            bar.Title = books.Title)
gbn
Trust the optimizer. Only if you find out from real testing that it's not making a wise choice should you even begin to consider trying to coax particular behavior.In any case, this use of TOP/ORDER BY is a terrible way to do what you can much more readably do by inserting the sought titles into an indexed table variable or temporary table and joining on that.A future optimizer improvement might well ignore the TOP, in fact, because it has no semantic meaning when the table expression it applies to has fewer than 2000... rows.
Steve Kass
@Steve: Good points, I forgot about temp table/table vars. I'd use a temp table for statistics/cardinality/est. rows: not a table variable. The TOP/ORDER BY is something I don't use anymore (and not sure of performance) but it *does* allow a single SQL call to happen. I'd also hope it'd be tested against other methods.
gbn
A: 

Given the choice of the two options, avoid IN clauses, as the number of items within the list goes up the query plan will alter and very quickly convert from a potential Seek to a Scan.

The normal tipping point (and I double checked on the adventure works) is that on the 65th item, it changes plan to a scan from a seek.

Andrew