views:

610

answers:

7

If I have a query like:

Select EmployeeId From Employee Where EmployeeTypeId IN (1,2,3)

and I have an index on the EmployeeTypeId field, does SQL server still use that index?

+3  A: 

Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.

Kibbee
A: 

So there's the potential for an "IN" clause to run a table scan, but the optimizer will try and work out the best way to deal with it?

lomaxx
A: 

I'd assume that if EmployeeTypeId is part of an index, then the query would use that index. IN is just a comparator.

@Kibbee +1

Terry Lorber
+6  A: 

Yeah, that's right. If your employee table has 10,000 records, and only 5 records have employeetypeID in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the employeeIDType in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIDs, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.

SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.

Select EmployeeId From Employee WITH (Index(Index_EmployeeTypeId )) Where EmployeeTypeId IN (1,2,3)

Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.

Kibbee
+1  A: 

So there's the potential for an "IN" clause to run a table scan, but the optimizer will try and work out the best way to deal with it?

Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.

The other posters are correct that an index will be used over a table scan if:

  • The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
  • Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.

The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).

As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.

Dana the Sane
+2  A: 

Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.

If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?

It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.

And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...

Mike Woodhouse
A: 

@Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.

With NHibernate, you can write a clause like this:

int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))

NHibernate then generates a query which looks like

select * from employee where employeeid in (1, 5, 23463, 32523)

So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.

lomaxx