MySQL index cardinality - performance vs storage efficiency

Sean picture Sean · Apr 8, 2010 · Viewed 13.4k times · Source

Say you have a MySQL 5.0 MyISAM table with 100 million rows, with one index (other than primary key) on two integer columns.

From my admittedly poor understanding of B-tree structure, I believe that a lower cardinality means the storage efficiency of the index is better, because there are less parent nodes. Whereas a higher cardinality means less efficient storage, but faster read performance, because it has to navigate through less branches to get to whatever data it is looking for to narrow down the rows for the query.

(Note - by "low" vs "high", I don't mean e.g. 1 million vs 99 million for a 100 million row table. I mean more like 90 million vs 95 million)

Is my understanding correct?

Related question - How does cardinality affect write performance?

Answer

Quassnoi picture Quassnoi · Apr 8, 2010

Whereas a higher cardinality means less efficient storage, but faster read performance, because it has to navigate through less branches to get to whatever data it is looking for to narrow down the rows for the query.

Higher cardinality means better read performance because, by definition, there are fewer records to read.

To process a query like this:

SELECT  *
FROM    mytable
WHERE   indexed_col = @myvalue

, the engine should do the following steps:

  1. Find the first entry satisfying the condition.

    This is done traversing the B-Tree, starting from the root entry.

    Across the pages, the search is performed by following B-Tree links; within a page, the search is performed using binary search (unless your keys are compressed, in which case it's a linear search).

    This algorithm same efficiency for both high cardinality and low cardinality columns. Finding the first 3 (as opposed to any 3) in these lists:

    1  2  3  4  5  6  7  8  9  10
    
    3  3  3  3  3  3  3  3  4  4
    

    requires same O(log(n)) steps.

  2. Traversing the index until the key value changes. This, of course, requires linear time: the more records you have, the more you need to traverse.

If you only need the first record:

SELECT  *
FROM    mytable
WHERE   indexed_col = @myvalue
LIMIT 1

, the column cardinality does not affect read performance.

How does cardinality affect write performance?

Each index key has a hidden additional value: a record pointer. This is the whole point of having an index: you need to know which record does it point to.

Since a record pointer, by definition, is unique, each index key is unique too. The index entries sharing the same key value are sorted by the record pointer.

This is to make the index maintainable: if you delete a record with a value of an indexed column shared by a million of other records, the corresponding index record should be deleted too. But the whole million of the index records is not being looked through: instead, the record pointer is used as an additional search condition.

Each index key is in fact unique (even if you don't define the index as unique), and, hence, has maximum cardinality possible.

So the answer to your questions is: no, the column cardinality does not affect the index write performance.