Decision between storing lookup table id's or pure data

cweston picture cweston · Dec 20, 2008 · Viewed 7.6k times · Source

I find this comes up a lot, and I'm not sure the best way to approach it.

The question I have is how to make the decision between using foreign keys to lookup tables, or using lookup table values directly in the tables requesting it, avoiding the lookup table relationship completely.

Points to keep in mind:

  • With the second method you would need to do mass updates to all records referencing the data if it is changed in the lookup table.

  • This is focused more towards tables that have a lot of the column's referencing many lookup tables.Therefore lots of foreign keys means a lot of joins every time you query the table.

  • This data would be coming from drop down lists which would be pulled from the lookup tables. In order to match up data when reloading, the values need to be in the existing list (related to the first point).

Is there a best practice here, or any key points to consider?

Answer

Bill Karwin picture Bill Karwin · Dec 20, 2008

You can use a lookup table with a VARCHAR primary key, and your main data table uses a FOREIGN KEY on its column, with cascading updates.

CREATE TABLE ColorLookup (
  color VARCHAR(20) PRIMARY KEY
);

CREATE TABLE ItemsWithColors (
  ...other columns...,
  color VARCHAR(20),
  FOREIGN KEY (color) REFERENCES ColorLookup(color)
    ON UPDATE CASCADE ON DELETE SET NULL
);

This solution has the following advantages:

  • You can query the color names in the main data table without requiring a join to the lookup table.
  • Nevertheless, color names are constrained to the set of colors in the lookup table.
  • You can get a list of unique colors names (even if none are currently in use in the main data) by querying the lookup table.
  • If you change a color in the lookup table, the change automatically cascades to all referencing rows in the main data table.

It's surprising to me that so many other people on this thread seem to have mistaken ideas of what "normalization" is. Using a surrogate keys (the ubiquitous "id") has nothing to do with normalization!


Re comment from @MacGruber:

Yes, the size is a factor. In InnoDB for example, every secondary index stores the primary key value of the row(s) where a given index value occurs. So the more secondary indexes you have, the greater the overhead for using a "bulky" data type for the primary key.

Also this affects foreign keys; the foreign key column must be the same data type as the primary key it references. You might have a small lookup table so you think the primary key size in a 50-row table doesn't matter. But that lookup table might be referenced by millions or billions of rows in other tables!

There's no right answer for all cases. Any answer can be correct for different cases. You just learn about the tradeoffs, and try to make an informed decision on a case by case basis.