views:

265

answers:

1

Say you have a MySQL 5.0 MyISAM table with 100 million rows, with one index (other than primary key) on two integer columns.

From my admittedly poor understanding of B-tree structure, I believe that a lower cardinality means the storage efficiency of the index is better, because there are less parent nodes. Whereas a higher cardinality means less efficient storage, but faster read performance, because it has to navigate through less branches to get to whatever data it is looking for to narrow down the rows for the query.

(Note - by "low" vs "high", I don't mean e.g. 1 million vs 99 million for a 100 million row table. I mean more like 90 million vs 95 million)

Is my understanding correct?

Related question - How does cardinality affect write performance?

+2  A: 

Whereas a higher cardinality means less efficient storage, but faster read performance, because it has to navigate through less branches to get to whatever data it is looking for to narrow down the rows for the query.

Higher cardinality means better read performance because, by definition, there are fewer records to read.

To process a query like this:

SELECT  *
FROM    mytable
WHERE   indexed_col = @myvalue

, the engine should do the following steps:

  1. Find the first entry satisfying the condition.

    This is done traversing the B-Tree, starting from the root entry.

    Across the pages, the search is performed by following B-Tree links; within a page, the search is performed using binary search (unless your keys are compressed, in which case it's a linear search).

    This algorithm same efficiency for both high cardinality and low cardinality columns. Finding the first 3 (as opposed to any 3) in these lists:

    1  2  3  4  5  6  7  8  9  10
    
    
    3  3  3  3  3  3  3  3  4  4
    

    requires same O(log(n)) steps.

  2. Traversing the index until the key value changes. This, of course, requires linear time: the more records you have, the more you need to traverse.

If you only need the first record:

SELECT  *
FROM    mytable
WHERE   indexed_col = @myvalue
LIMIT 1

, the column cardinality does not affect read performance.

How does cardinality affect write performance?

Each index key has a hidden additional value: a record pointer. This is the whole point of having an index: you need to know which record does it point to.

Since a record pointer, by definition, is unique, each index key is unique too. The index entries sharing the same key value are sorted by the record pointer.

This is to make the index maintainable: if you delete a record with a value of an indexed column shared by a million of other records, the corresponding index record should be deleted too. But the whole million of the index records is not being looked through: instead, the record pointer is used as an additional search condition.

Each index key is in fact unique (even if you don't define the index as unique), and, hence, has maximum cardinality possible.

So the answer to your questions is: no, the column cardinality does not affect the index write performance.

Quassnoi
Thank you for the highly detailed answer. My question was related to multi-column indexes, but your examples are for single-column indexes. Does that change anything? Also, storage efficiency is important to me as well. For multi-colum indexes, I was thinking that high cardinality of the first (left) column of the index would mean more storage space, vs having the lower cardinality column on the left. Higher cardinality on the left would mean more parent nodes, correct? Does that affect storage space at all? Thanks again :)
Sean
@Sean: this is also valid for composite indexes. If you have key compression enabled (in `MyISAM`), low cardinality columns can even save you some space (but they imply linear search in the pages, so it's a matter of tradeoff). The number of parent nodes totally depends on the number of records that can fit on a page.
Quassnoi
Great. Thank you!
Sean