What are the design criteria for primary keys?

bbadour picture bbadour · Sep 3, 2010 · Viewed 13.8k times · Source

Choosing good primary keys, candidate keys and the foreign keys that use them is a vitally important database design task -- as much art as science. The design task has very specific design criteria.

What are the criteria?

Answer

OMG Ponies picture OMG Ponies · Sep 3, 2010

What is a Primary Key?

The primary key is something that uniquely identifies a row/record of data. It can also be multiple columns, which is called a composite.

Ability to Change

Because the primary key is often used for foreign references, it should be as stable as possible. All data in the database is mutable, providing someone is connecting with an account that has appropriate privileges. This is why databases provide the ability to define CASCADE ON DELETE and CASCADE ON UPDATE--to sync referential dependencies without having to disable constraints.

Natural or Artifical/Surrogate?

Ideally, you want a natural key. A natural key is existing data that uniquely identifies the entity you are modeling. For example, the abbreviations of US states is a good natural key because the abbreviation is consistent and everyone knows them:

US_STATE_PRIMARY_KEY  US_STATE
-------------------------- 
AL                    Alabama
AK                    Alaska
AZ                    Arizona
AR                    Arkansas
CA                    California

Don't try too hard to find a natural key. They seldom exist. It's unlikely that a US State name would change, but it is plausible.

Realistically, primary keys will typically be artificial (often generated by database functionality). These are typically numbers or GUIDs, and they're considered artificial because on their own - there's nothing to relate their value to the information they uniquely identify. A sales receipt is always numbered, because there's nothing natural about it and it's also for auditing - gaps in the receipt numbers raise suspicions. To demonstrate how arbitrary numbering is, here's the US state table but using an integer for the primary key column, US_STATE_CODE:

US_STATE_PRIMARY_KEY  US_STATE
-------------------------- 
100                   Alabama
101                   Alaska
102                   Arizona
103                   Arkansas
104                   California

There's no requirement to start the value at one; some shops use this as a security measure to thwart SQL injection. The value is sequential based on the alphabetic ordering of the State name, but that can't be guaranteed. But unlike the natural key, if the state name changed - only one column would have to be updated.

Single Column vs Composite

Ideally one column will be the primary key, but make the decision based on the data at hand--do not combine columns just for the sake of having a single column. If you do shoehorn data together, use a character to separate the data easily (though operations to do this won't be able to take advantage of an index if present).

Performance

From a performance perspective, integers are best because they offer a decent range of values and the number of bytes used is small when you compare to VARCHAR of five or more characters.