Why does adding an index to a database field speed up searching over that field?

Scott Lemmon picture Scott Lemmon · Sep 27, 2012 · Viewed 9.4k times · Source

I'm new to databases and have been reading that adding an index to a field you need to search over can dramatically speed up search times. I understand this reality, but am curious as to how it actually works. I've searched a bit on the subject, but haven't found any good, concise, and not over technical answer to how it works.

I've read the analogy of it being like an index at the back of a book, but in the case of a data field of unique elements (such as e-mail addresses in a user database), using the back of the book analogy would provide the same linear look up time as a non-indexed seach.

What is going on here to speed up search times so much? I've read a little bit about searching using B+-Trees, but the descriptions were a bit too indepth. What I'm looking for is a high level overview of what is going on, something to help my conceptual understanding of it, not technical details.

Answer

Jarod Elliott picture Jarod Elliott · Sep 27, 2012

Expanding on the search algorithm efficiencies, a key area in database performance is how fast the data can be accessed. In general, reading data from a disk is a lot lot slower than reading data from memory.

To illustrate a point, lets assume everything is stored on disk. If you need to search through every row of data in a table looking for certain values in a field, you still need to read the entire row of data from the disk to see if it matches - this is commonly referred to as a 'table scan'.

If your table is 100MB, that's 100MB you need to read from disk.

If you now index the column you want to search on, in simplistic terms the index will store each unique value of the data and a reference to the exact location of the corresponding full row of data. This index may now only be 10MB compared to 100MB for the entire table.

Reading 10MB of data from the disk (and maybe a bit extra to read the full row data for each match) is roughly 10 times faster than reading the 100MB.

Different databases will store indexes or data in memory in different ways to make these things much faster. However, if your data set is large and doesn't fit in memory then the disk speed can have a huge impact and indexing can show huge gains. In memory there can still be large performance gains (amongst other efficiencies).

In general, that's why you may not notice any tangible difference with indexing a small dataset which easily fits in memory.

The underlying details will vary between systems and actually will be a lot more complicated, but I've always found the disk reads vs. memory reads an easily understandable way of explaining this.