views:

112

answers:

3

Can a select query use different indexes if a change the value of a where condition?

The two following queries use different indexes and the only difference is the value of the condition and typeenvoi='EXPORT' or and typeenvoi='MAIL'

select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
            from v_envoiautomate
            where etat=0 and typeenvoi='EXPORT'
            and nbessais<1 


select numenvoi,adrdest,nomdest,etat,nbessais,numappel,description,typeperiode,datedebut,datefin,codeetat,codecontrat,typeenvoi,dateentree,dateemission,typedoc,numdiffusion,nature,commentaire,criselcomp,crisite,criservice,chrono,codelangueetat,piecejointe, sujetmail, textemail
            from v_envoiautomate
            where etat=0 and typeenvoi='MAIL'
            and nbessais<1

Can anyone give me an explanation?

+2  A: 

Probably has to do with the "cardinality", I believe the word is, of the values in the table. If there are a lot more rows that match that clause, SQL Server may decide that one query will be more efficient using an index for a different column. This is an extreme case, but if there was one row that matched 'MAIL', it would likely use that index. If every other row in the table was 'EXPORT', but only half of those 'EXPORT' rows had an etat of 0, then it would probably use the index on that column.

Lazy Bob
@Lazy Bobo: You are correct in your reasoning however I believe the "word" you are looking for is "selectivity" i.e. the selectivity of a column which is based on the distribution of the data values
John Sansom
+8  A: 

Details on indexes are stored as statistics in a histogram-type dataset in SQL Server.

Each index is chunked into ranges, and each range contains a summary of the key values within that range, things like:

  • range High value
  • number of values in the range
  • number of distinct values in the range (cardinality)
  • number of values equal to the High value

...and so on.

You can view the statistics on a given index with:

DBCC SHOW_STATISTICS(<tablename>, <indexname>)

Each index has a couple of characteristics like density, and ultimately selectivity, that tell the query optimiser how unique each value in an index is likely to be, and how efficient this index is at quickly locating records.

As your query has three columns in the where clause, it's likely that any of these columns might have an index that could be useful to the optimiser. It's also likely that the primary key index will be considered, in the event of the selectivity of other indexes not being high enough.

Ultimately, it boils down to the optimiser making a quick judgement call on how many page reads will be necessary to read each your non-clustered indexes + bookmark lookups, with comparisons with the other values, vs. doing a table scan.

The statistics that these judgements are based on can vary wildly too; SQL Server, by default, only samples a small percentage of any significant table's rows, so the selectivity of that index might not be representative of the whole. This is particularly problematic where you have highly non-unique keys in the index.

In this specific case, I'm guessing your typeenvoi index is highly non-unique. This being so, the statistics gathered probably indicate to the optimiser that one of the values is rarer than the other, and the likelihood of that index being chosen is increased.

Jeremy Smyth
+3  A: 

The query optimiser in SQL Server (as in most modern DBMS platforms) uses a methodology known as 'cost based optimisation.' In order to do this it uses statistics about the tables in the database to estimate the amount of I/O needed. The optimiser will consider a number of semantically equivalent query plans that it generates by transforming a basic query plan generated by parsing the statement.

Each plan is evaluated for cost by a heuristic based on the statistics maintained about the tables. The statistics come in various flavours:

  • Table and index row counts

  • Distributions histograms of the values in individual columns.

If the ocurrence of 'MAIL' vs. 'EXPORT' in the distribution histograms is significantly different the query optimiser can come up with different optimal plans. This is probably what happened.

ConcernedOfTunbridgeWells