views:

4219

answers:

14

I remember reading at one point that indexing a field with low cardinality (a low number of distinct values) is not really worth doing. I admit I don't know enough about how indexes work to understand why that is.

So what if I have a table with 100 million rows in it, and I am selecting records where a bit field is 1? And let's say that at any point in time, there are only a handful of records where the bit field is 1 (as opposed to 0). Is it worth indexing that bit field or not? Why?

Of course I can just test it and check the execution plan, and I will do that, but I'm also curious about the theory behind it. When does cardinality matter and when does it not?

A: 

Is this a common query? It may be worth it when looking for the "handful" of records but won't help you much on the other rows. Are there other ways to identify the data?

jms
+1  A: 

If you want to know if an index has the effects you desire: test and test again.

In general you don't want an index that doesn't narrow down your table enough, because of the cost to maintain an index. (cost > profit). But if the index in your case will cut the table in half, you may gain something but putting it on the table. It all depends on the exact size/structure of your table and how you are using it (number of reads/writes).

thijs
A: 

Of course it worths, especially if you need to retrieve the data by that value. It would be similar to using a sparse matrix instead of using a normal matrix.

Now with SQL 2008 you can use partitioning functions, and you are able to filter the data that goes in an index. The disadvantage for earlier versions would be that the index would be made for all the data, but this can be optimized by storing the interesting values in a separate file group.

Bogdan Maxim
+1  A: 

As others have said, you'll want to measure this. I don't recall where I've read this, but a column needs to have very high cardinality (around 95%) in order for an index to be effective. Your best test for this would be to build the index and examine the execution plans for the 0 and 1 values of the BIT field. If you see an index seek operation in the execution plan then you know that your index will be used.

Your best course of action would be to test the with a basic SELECT * FROM table WHERE BitField = 1; query and slowly build out the functionality from there step by step until you have a realistic query for your application, examining the execution plan with every step to make sure that the index seek is still being utilized. Admittedly, there is no guarantee that this execution plan will be used in production, but there is a good chance that it will be.

Some information can be found on the sql-server-performance.com forums and in the referenced article

Jeremiah Peschka
It's not such much the cardinality of the column as a whole that matters. It is the selectivity of the WHERE clause. So if there are few columns with value 1, it can still be good to index. If it's 50/50 (e.g. male/female) then not so worth it.
WW
+4  A: 

While I don't think I would index JUST a bit column by itself, it is very common to include bit columns as part of a compound index.

A simple example would be an index on ACTIVE, LASTNAME instead of just lastname, when your application is almost always looking for active customers.

BradC
In the example you gave, I would be more inclined to put LastName first. It depends on the specific query workload, but in general having the more selective column first, means the index is more likely to be used.
Mitch Wheat
+2  A: 

"I remember reading at one point that indexing a field with low cardinality (a low number of distinct values) is not really worth doing"

That because SQL Server will almost always find its more efficient to just do a table-scan than to read the index. So basically your index will never get used and it's a waste to maintain it. As others have said it might be ok in a compound index.

DJ
A: 

Cardinality is one factor, the other is how well does the index divide your data. If you have about half 1s and half 0s, then it will help. (Assuming that that index is a better path to choose than some other index). However, how often are you inserting and updating? Adding indexes for SELECT performance also hurt the INSERT, UPDATE and DELETE performance, so keep that in mind.

I would say, if the 1s to 0s (or vice versa) isn't better than 75% to 25%, don't bother.

Anthony Potts
I would disagree. If your distribution is 50/50, then you would never use the index, as it would just be quicker to do a table scan. However, if you only have 5, 1 values, and 1 million 0 values, it would be very likely to use the index when searching for 1.
Kibbee
A: 

You can't index a bit field in SQL Server 2000, as was indicated in the Books Online at the time:

bit

Integer data type 1, 0, or NULL.

Remarks

Columns of type bit cannot have indexes on them.

Yes, if you have only a handful of rows, out of millions, an index will help. But if you want to do it in this case you need to make the column a tinyint.

Note: Enterprise Manager will not let you create an index on a bit column. If you wish you can still manually create an index on a bit column:

CREATE INDEX IX_Users_IsActiveUsername ON Users
(
   IsActive,
   Username
)

But SQL Server 2000 will not actually use such an index - running a query where the index would be a perfect candidate, e.g.:

SELECT TOP 1 Username 
FROM Users
WHERE IsActive = 0

SQL Server 2000 will do a table scan instead, acting as though the index doesn't even exist. If you change the column to a tinyint SQL Server 2000 will do an index seek. Also, the following non-covered query:

SELECT TOP 1 * 
FROM Users
WHERE IsActive = 0

It will perform an index seek, followed by a bookmark lookup.


SQL Server 2005 does have limited support for indexes on bit columns. For example:

SELECT TOP 1 Username 
FROM Users
WHERE IsActive = 0

will cause an index seek through the covering index. But the non-covered case:

SELECT TOP 1 * 
FROM Users
WHERE IsActive = 0

will not cause an index seek followed by a bookmark lookup, it will perform a table scan (or clustered index scan), rather than performing the index seek followed by a bookmark lookup.

Verified by experimentation and direct observation.

Ian Boyd
FYI - SQL Server 2005 Management Studio does let you do it.
jeremcc
My copy of SQL Server 2000 let me set an index on a bit column.
Kibbee
My copy of SQL Server 2000 doesn't let me set an index on a bit column.
Ian Boyd
+1  A: 

On its own, no as it results in very little selectivity. As part of a compound index. quite possibly but only after other equality columns.

Craig Nicholson
+10  A: 

Consider what an index is in SQL - and index is really a chunk of memory pointing at other chunks of memory (i.e. pointers to rows). The index is broken into pages so that portions of the index can be loaded and unloaded from memory depending on usage.

When you ask for a set of rows, SQL uses the index to find the rows more quickly than table scanning (looking at every row).

SQL has clustered and non-clustered indexes. My understanding of clustered indexes try group similar index values into the same page. This way when you ask for all the rows matching an index value, SQL can return those rows from a clustered page of memory. This is why trying the cluster index a GUID column is a bad idea - you don't try to cluster random values.

When you index and integer column, SQL's index contains a set of rows for each index value. If you have a range of 1 to 10, then you would have 10 index pointers. Depending on how many rows there are this can be paged differently. If your query looks for the index matching "1" and then where Name contains "Fred" (assuming the Name column is not indexed), SQL gets the set of rows matching "1" very quickly, then table scans to find the rest.

So what SQL is really doing is trying to reduce the working set (number of rows) it has to iterate over.

When you index a bit field (or some narrow range), you only reduce the working set by the number of rows matching that value. If you have a small number of rows matching it would reduce your working set a lot. For a large number of rows with 50/50 distribution, it might buy you very little performance gain vs. keeping the index up to date.

The reason everyone says to test is because SQL contains a very clever and complex optimizer that may ignore an index if it decides table scanning is faster, or may use a sort, or may organize memory pages however it darn well likes.

Geoff Cox
So it sounds like if I only ever have a handful of rows where the bit field is 1 (for example keeping track of "IsProcessed"), then an index would be good because it will order them by value and then be able to select the small working set very quickly. If you agree, add that and I will accept it.
jeremcc
What I mean in my previous comment is that this statement:"When you index a bit field (or some narrow range), you only reduce the working set in half"is not true if the distribution is heavily weighted toward one value. But I like the rest of your answer, so if you fix that, I'll accept it.
jeremcc
Done. I was thinking that for a million rows, a bit field would have 50% distribution, but you are right that for a particular problem space it could reduce the working set a lot.
Geoff Cox
It's worthwhile to look at execution plans with and without the index, and see if the index is being used and if it actually reduces the cost of your queries. Easy and scientific!
onupdatecascade
+5  A: 

100-million records with only a few having the bit field set to 1? Yes, I would think indexing the bit field would definitely speed up querying the bit=1 records. You should get logarithmic search time from the index and then only touch the few pages with bit=1 records. Otherwise, you'd have to touch all pages of the 100-million record table.

Then again, I'm definitely not a database expert and could be missing something important.

C. Dragon 76
+2  A: 

If your goal is to make querying for records where the bit field value equals '1' faster you might try an indexed view of your base table which only contains records where your bit field equals '1'. In enterprise edition if a query could make use of an indexed view instead of a specified table to improve query performance it will use the view. In theory this would increase the speed of select queries which only look for records with a bit field value of '1'.

http://www.microsoft.com/technet/prodtechnol/sql/2005/impprfiv.mspx

All this assumes you are Microsoft SQL Server 2005 Enterprise. The same might apply to 2008, I'm not familiar with that version.

It would be nice if someone tested this ...
Kenny Evitt
A: 

In case you haven't read it, Jason Massie wrote an article recently that discussed this very topic.

http://statisticsio.com/Home/tabid/36/articleType/ArticleView/articleId/302/Never-Index-a-BIT.aspx

Jeff
A: 

Ian Boyd is correct when he says that you could not do it via Enterprise Manager for SQL 2000 (see his note regarding creating it throught T-SQL.

John B