In MySQL, an index type is a b-tree, and access an element in a b-tree is in logarithmic amortized time O(log(n))
.
On the other hand, accessing an element in a hash table is in O(1)
.
Why is a hash table not used instead of a b-tree in order to access data inside a database?
You can only access elements by their primary key in a hashtable.
This is faster than with a tree algorithm (O(1)
instead of log(n)
), but you cannot select ranges (everything in between x
and y
).
Tree algorithms support this in Log(n)
whereas hash indexes can result in a full table scan O(n)
.
Also the constant overhead of hash indexes is usually bigger (which is no factor in theta notation, but it still exists).
Also tree algorithms are usually easier to maintain, grow with data, scale, etc.
Hash indexes work with pre-defined hash sizes, so you end up with some "buckets" where the objects are stored in. These objects are looped over again to really find the right one inside this partition.
So if you have small sizes you have a lot of overhead for small elements, big sizes result in further scanning.
Todays hash tables algorithms usually scale, but scaling can be inefficient.
There are indeed scalable hashing algorithms. Don't ask me how that works - its a mystery to me too. AFAIK they evolved from scalable replication where re-hashing is not easy.
Its called RUSH - Replication Under Scalable Hashing, and those algorithms are thus called RUSH algorithms.
However there may be a point where your index exceeds a tolerable size compared to your hash sizes and your entire index needs to be re-built. Usually this is not a problem, but for huge-huge-huge databases, this can take days.
The trade off for tree algorithms is small and they are suitable for almost every use case and thus are default.
However if you have a very precise use case and you know exactly what and only what is going to be needed, you can take advantage of hashing indexes.