views:

41

answers:

3

Do the statistics (which help decide whether an index is to be used) take into account the number of rows per actual column value, or does it just use the average number of rows per value.

Suppose I have a table with an bit column called active which has a million of rows, but with 99.99% set to false. If I have an index on this column, then is Sql smart enough to know to use the index if searching for active=1 but that there is no point if searching for active=0.

Another example, if I have a table which has say 1,000,000 records with a indexed column which contains about 50,000 different values with an average number of rows per value of 10, but then one special value which has 500,000 rows. The index may not be useful if searching for this special record, but would be very useful when looking for any of the other codes.

But does this special case ruin the effectiveness of the index.

+1  A: 

It creates a histogramm and will thus use that.

With a bit column it will have a good idea how many are 0 and 1

With a string column, it iwll have a rough idea of "bands" (value starting a, b, c etc.). Same for numbers (it creates x bands of value ranges).

Just look up how statistics look in your management studio - you can actually access the histograms.

TomTom
+3  A: 

You can see for yourself:

CREATE TABLE IndexTest (
Id int not null primary key identity(1,1),
Active bit not null default(0),
IndexedValue nvarchar(10) not null
)

CREATE INDEX IndexTestActive ON IndexTest (Active)
CREATE INDEX IndexTestIndexedValue ON IndexTest (IndexedValue)

DECLARE @values table
(
    Id int primary key IDENTITY(1, 1),
    Value nvarchar(10)
)

INSERT INTO @values(Value) VALUES ('1')
INSERT INTO @values(Value) VALUES ('2')
INSERT INTO @values(Value) VALUES ('3')
INSERT INTO @values(Value) VALUES ('4')
INSERT INTO @values(Value) VALUES ('5')
INSERT INTO @values(Value) VALUES ('Many')
INSERT INTO @values(Value) VALUES ('Many')
INSERT INTO @values(Value) VALUES ('Many')
INSERT INTO @values(Value) VALUES ('Many')
INSERT INTO @values(Value) VALUES ('Many')

DECLARE @rowCount int
SET @rowCount = 100000

WHILE(@rowCount > 0)
BEGIN
    DECLARE @valueIndex int
    SET @valueIndex = CAST(RAND() * 10 + 1 as int)
    DECLARE @selectedValue nvarchar(10)
    SELECT @selectedValue = Value FROM @values WHERE Id = @valueIndex
    DECLARE @isActive bit
    SELECT @isActive = CASE 
            WHEN RAND() < 0.001 THEN 1 
            ELSE 0
           END
    INSERT INTO IndexTest(Active, IndexedValue) VALUES (@isActive, @selectedValue)
    SET @rowCount = @rowCount - 1
END

SELECT count(*) FROM IndexTest WHERE Active = 1
SELECT count(*) FROM IndexTest WHERE Active = 0

SELECT count(*) FROM IndexTest WHERE IndexedValue = '1'
SELECT count(*) FROM IndexTest WHERE IndexedValue = 'Many'

It looks to me like it always uses the indexes on this query plan:

query plan

codekaizen
Thanks. This helps. If I use where Active=0 and IndexedValue='1' it will do a table scan whereas Active=1 and IndexValue='1' will use the index.
sgmoore
+1  A: 

You can simply look at the statistics and see for yourself :) DBCC SHOW_STATISTICS. See the Remarks section, it has a nice explanation of how the histograms are actually stored and used:

To create the histogram, the query optimizer sorts the column values, computes the number of values that match each distinct column value and then aggregates the column values into a maximum of 200 contiguous histogram steps. Each step includes a range of column values followed by an upper bound column value. The range includes all possible column values between boundary values, excluding the boundary values themselves. The lowest of the sorted column values is the upper boundary value for the first histogram step.

alt text

For each histogram step:

  • Bold line represents the upper boundary value (RANGE_HI_KEY) and the number of times it occurs (EQ_ROWS)
  • Solid area left of RANGE_HI_KEY represents the range of column values and the average number of times each column value occurs (AVG_RANGE_ROWS). The AVG_RANGE_ROWS for the first histogram step is always 0.
  • Dotted lines represent the sampled values used to estimate total number of distinct values in the range (DISTINCT_RANGE_ROWS) and total number of values in the range (RANGE_ROWS). The query optimizer uses RANGE_ROWS and DISTINCT_RANGE_ROWS to compute AVG_RANGE_ROWS and does not store the sampled values.

The query optimizer defines the histogram steps according to their statistical significance. It uses a maximum difference algorithm to minimize the number of steps in the histogram while maximizing the difference between the boundary values. The maximum number of steps is 200. The number of histogram steps can be fewer than the number of distinct values, even for columns with fewer than 200 boundary points. For example, a column with 100 distinct values can have a histogram with fewer than 100 boundary points.

Remus Rusanu