views:

136

answers:

3

I have a situation where I have to dynamically create my SQL strings and I'm trying to use paramaters and sp_executesql where possible so I can reuse query plans. In doing lots of reading online and personal experience I have found "NOT IN"s and "INNER/LEFT JOIN"s to be slow performers and expensive when the base (left-most) table is large (1.5M rows with like 50 columns). I also have read that using any type of function should be avoided as it slows down queries, so I'm wondering which is worse?

I have used this workaround in the past, although I'm not sure it's the best thing to do, to avoid using a "NOT IN" with a list of items when, for example I'm passing in a list of 3 character strings with, for example a pipe delimiter (only between elements):

LEN(@param1) = LEN(REPLACE(@param1, [col], '')) 

instead of:

[col] NOT IN('ABD', 'RDF', 'TRM', 'HYP', 'UOE') 

...imagine the list of strings being 1 to about 80 possible values long, and this method doesn't lend it self to paraterization either.

In this example I can use "=" for a NOT IN and I would use a traditional list technique for my IN, or != if that is a faster although I doubt it. Is this faster than using the NOT IN?

As a possible third alternative, what if I knew all the other possibilities (the IN possabilities, which could potentially be 80-95x longer list) and pass those instead; this would be done in the application's Business Layer as to take the workload off of the SQL Server. Not a very good possability for query plan reuse but if it shaves a sec or two off a big nasty query, why the hell not.

I'm also adept at SQL CLR function creation. Since the above is string manipulation would a CLR function be best?

Thoughts?

Thanks in advance for any and all help/advice/etc.

+2  A: 

I have found "NOT IN"s and "INNER/LEFT JOIN"s to be slow performers and expensive when the base (left-most) table is large

It shouldn't be slow if you indexed your table correctly. Something that can make the query slow is if you have a dependent subquery. That is, the query must be re-evaluated for each row in the table because the subquery references values from the outer query.

I also have read that using any type of function should be avoided as it slows down queries

It depends. SELECT function(x) FROM ... probably won't make a huge difference to the performance. The problems are when you use function of a column in other places in the query such as JOIN conditions, WHERE clause, or ORDER BY as it may mean that an index cannot be used. A function of a constant value is not a problem though.

Regarding your query, I'd try using [col] NOT IN ('ABD', 'RDF', 'TRM', 'HYP', 'UOE') first. If this is slow, make sure that you have indexed the table appropriately.

Mark Byers
Would the NOT IN perform much better if I passed it the list of 80-90 items the user wanted to include instead of the 5 or so he/she wanted to filter out, and use the IN instead?
...also I could have sworn I read that when the NOT operator is used, an index cannot be used to evaluate that clause. Although I don't have the book in front of me it was from MCTS certification training for SQL Server 2008 Development.
@Mark, `SELECT function(x) FROM ...` is still a dog, even in the SELECT clause.
Peter
A: 

First off, since you are only filtering out a small percentage of the records, chances are the index on col isn't being used at all so SARG-ability is moot.

So that leaves query plan reuse.

  • If you are on SQL Server 2008, replace @param1 with a table-valued parameter, and have your application pass that instead of a delimited list. This solves your problem completely.

  • If you are on SQL Server 2005, I don't think it matters. You could split the delimited list and use NOT IN/NOT EXISTS against the table, but what's the point if you won't get an index seek on col?

Can anyone speak to the last point? Would splitting the list to a table var and then anti-joining it save enough CPU cycles to offset the setup cost?

EDIT, third method for SQL Server 2005 using XML, inspired by OMG Ponies' link:

DECLARE @not_in_xml XML
SET @not_in_xml = N'<values><value>ABD</value><value>RDF</value></values>'

SELECT * FROM Table1 
WHERE @not_in_xml.exist('/values/value[text()=sql:column("col")]') = 0

I have no idea how well this performs compared to a delimited list or TVP.

Peter
http://dotnethitman.spaces.live.com/Blog/cns!E149A8B1E1C25B14!222.entry
OMG Ponies
@OMG Ponies, do you know this performs compared to (a) `CHARINDEX()` on a delimited list, or (b) a split function?
Peter
CHARINDEX ensures a tablescan; a derived table is more accommodating (JOIN, EXISTS, etc) but it doesn't change the fact of deriving the data into the table. A CLR table valued function could perform better, just depends on the amount of data.
OMG Ponies
@OMG Ponies: He'll never get an index seek on `col`, given his use case. Not selective enough. Given that, how do these methods compare?
Peter
+1  A: 

As Donald Knuth is often (mis)quoted, "premature optimization is the root of all evil".
So, first of all, are you sure that if you write your code in the most clear and simple way (to both write and read), it performs slowly? If not, check it, before starting to use any "clever" optimization tricks.

If the code is slow, check the query plans thouroughly. Most of the time query execution takes much longer than query compilation, so usually you do not have to worry about query plan reuse. Hence, building optimal indexes and/or table structures usually gives significantly better results than tweaking the ways the query is built.

For instance, I seriously doubt that your query with LEN and REPLACE has better performance than NOT IN - in either case all the rows will be scanned and checked for a match. For a long enough list MSSQL optimizer would automatically create a temp table to optimize equality comparison.
Even more, tricks like this tend to introduce bugs: say, your example would work incorrectly if [col] = 'AB'.

IN queries are often faster then NOT IN, because for IN queries only part of the rows is enough to be checked. The efficiency of the method depends on whether you can get a correct list for IN quickly enough.

Speaking of passing a variable-length list to the server, there're many discussions here on SO and elsewhere. Generally, your options are:

  • table-valued parameters (MSSQL 2008+ only),
  • dynamically constructed SQL (error prone and/or unsafe),
  • temp tables (good for long lists, probably too much overhead in writing and execution time for short ones),
  • delimited strings (good for short lists of 'well-behaved' values - like a handful of integers),
  • XML parameters (somewhat complex, but works well - if you use a good XML library and do not construct complex XML text 'by hand').

Here is an article with a good overview of these techniques and a few more.

VladV
Awesome article link. Thanks!