What is Normalisation (or Normalization)?

AJ. picture AJ. · Oct 29, 2008 · Viewed 62.8k times · Source

Why do database guys go on about normalisation?

What is it? How does it help?

Does it apply to anything outside of databases?

Answer

JacquesB picture JacquesB · Oct 29, 2008

Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If some piece of data is duplicated several places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.

There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem, usually related to redundancy.

Some typical normalization errors:

(1) Having more than one value in a cell. Example:

UserId | Car
---------------------
1      | Toyota
2      | Ford,Cadillac

Here the "Car" column (which is a string) have several values. That offends the first normal form, which says that each cell should have only one value. We can normalize this problem away by have a separate row per car:

UserId | Car
---------------------
1      | Toyota
2      | Ford
2      | Cadillac

The problem with having several values in one cell is that it is tricky to update, tricky to query against, and you cannot apply indexes, constraints and so on.

(2) Having redundant non-key data (ie. data repeated unnecessarily in several rows). Example:

UserId | UserName | Car
-----------------------
1      | John     | Toyota
2      | Sue      | Ford
2      | Sue      | Cadillac

This design is a problem because the name is repeated per each column, even though the name is always determined by the UserId. This makes it theoretically possible to change the name of Sue in one row and not the other, which is data corruption. The problem is solved by splitting the table in two, and creating a primary key/foreign key relationship:

UserId(FK) | Car               UserId(PK) | UserName
---------------------          -----------------
1          | Toyota            1          | John
2          | Ford              2          | Sue
2          | Cadillac

Now it may seem like we still have redundant data because the UserId's are repeated; However the PK/FK constraint ensures that the values cannot be updated independently, so integrity is safe.

Is it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.

Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.

There are a number of misconceptions around normalization:

  • some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.

  • sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.

Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.