views:

118

answers:

3

I recently encountered an index in a database I maintain that was of the form:

CREATE INDEX [IX_Foo] ON [Foo]
( Id ASC )
INCLUDE 
( SubId )

In this particular case, the performance problem that I was encountering (a slow SELECT filtering on both Id and SubId) could be fixed by simply moving the SubId column into the index proper rather than as an included column.

This got me thinking however that I don't understand the reasoning behind included columns at all, when generally, they could simply be a part of the index itself. Even if I don't particularly care about the items being in the index itself is there any downside to having column in the index rather than simply being included.

After some research, I am aware that there are a number of restrictions on what can go into an indexed column (maximum width of the index, and some column types that can't be indexed like 'image'). In these cases I can see that you would be forced to include the column in the index page data.

The only thing I can think of is that if there are updates on SubId, the row will not need to be relocated if the column is included (though the value in the index would need to be changed). Is there something else that I'm missing?

I'm considering going through the other indexes in the database and shifting included columns in the index proper where possible. Would this be a mistake?

I'm primarily interested in MS SQL Server, but information on other DB engines is welcome also.

+7  A: 

The reason to have an additional column in an index is so that when you do a query that only requires the columns used by the index you can fulfill the query from the index by itself. This way you save some time and resources going back to the table. When this happens, we say the index is a covering index for the query.

The reason you might not want to make this additional column part of the "index proper" is because when you do inserts or updates on that column you're more likely to need to re-sort parts of the index.

Joel Coehoorn
Another interest of INCLUDEing covering columns rather than just adding them to the index (as a key) is that such columns are only added to the leafs of the index, generally making the index smaller and more efficient.
mjv
That's essentially what I was thinking. In my scenario the table that I'm working with is (essentially) readonly, so there isn't really a downside to the extra work that would have to be done on update. @mjv has a point about the size of the index being potentially smaller, by not having to replicate the value of the included columns in non-leaf pages, but I'm not seeing a big enough size difference in my particular scenario to matter. I'm going to go ahead and put the INCLUDEd columns in the index proper to handle the rare case where that column is a part of the search criteria.
StarLite
+2  A: 

Using include in an index allows the index to be used as a covering index (i.e. certain queries can be satisfied using that index alone, without having to perform a bookmark lookup into the clustered index), without adding those columns to the actual tree part of the index, thus keeping the size of the index down. (The included columns are only added to the leaf nodes of the index).

Mitch Wheat
+3  A: 

The answers so far are all correct and all - but they might not convey enough what you gain from a covering index.

In your case, you have a table Foo and some fields, including an Id (which I assume is the primary key), and a SubId which is some additional ID of some kind.

You also have an index IX_Foo which I assume had only Id in it for now.

So now you need to find the SubId for Id=4.

SELECT Id, SubId
FROM Foo
WHERE Id=4
  • SQL Server will look at the SELECT statement and determine it can use IX_Foo
  • it will then go search for the value Id=4 in your index IX_Foo
  • when it finds it, it now needs the value of SubId, too
  • the non-clustered index IX_Foo will contain the clustering key value
  • using that clustering key value, SQL Server will do a "bookmark lookup" to locate the actual data page where your entire data row is located
  • it will fetch that page and extract the value for SubId from it
  • it will return those values to satisfy your query

The main point here is: once SQL Server has found your Id=4 in the IX_Foo index, it will then need to do another I/O operation, a bookmark lookup, to go fetch the whole data row, in order to be able to find the SubId value.

If you have a covering index, e.g. IX_Foo also includes SubId, that extra I/O to do the bookmark lookup is eliminated. Once the value Id=4 is found in the IX_Foo index, that index page in your non-clustered index will also include the value of SubId - SQL Server can now return those two values you asked for in your SELECT query without having to do an extra (potentially expensive and thus slow) bookmark lookup just to go fetch another Id column.

That's the main benefit of covering indices - if you only need one or two extra columns, besides the index values you're doing the lookup on, by including those values into the index itself, you can save yourself a lot of bookmark lookups and thus speed things up significantly. You should however only include very few, and small bits of information - don't duplicate your entire data rows into all non-clustered indices! That's not the point.

UPDATE: the trade-off is this: if you have an index on (Id, SubId), all the pages in the index have both columns - the whole index tree through.

If you INCLUDE(SubId), the SubId fields are only present at the leaf level.

This means

  • SQL Server can't search and compare on SubId (the values are not in the index tree)
  • less space is used since the values are only on the leaf level
marc_s
I understand that part. The real question is about the tradeoff between having an index on Id and SubId vs. and index on Id including SubId.
StarLite
@StarLIte: added an explanation of the trade-off to my post
marc_s