Versioning in SQL Tables - how to handle it?

corsiKa picture corsiKa · Sep 22, 2010 · Viewed 29.3k times · Source

Here's a fictional scenario with some populated data. For tax purposes, my fictional company must retain records of historical data. For this reason, I've included a version column to the table.

TABLE EMPLOYEE: (with personal commentary)

|ID | VERSION | NAME       | Position | PAY |
+---+---------+------------+----------+-----+
| 1 |    1    | John Doe   | Owner    | 100 | Started company
| 1 |    2    | John Doe   | Owner    |  80 | Pay cut to hire a coder
| 2 |    1    | Mark May   | Coder    |  20 | Hire said coder
| 2 |    2    | Mark May   | Coder    |  30 | Productive coder gets raise
| 3 |    1    | Jane Field | Admn Asst|  15 | Need office staff
| 2 |    3    | Mark May   | Coder    |  35 | Productive coder gets raise
| 1 |    3    | John Doe   | Owner    | 120 | Sales = profit for owner!
| 3 |    2    | Jane Field | Admn Asst|  20 | Raise for office staff
| 4 |    1    | Cody Munn  | Coder    |  20 | Hire another coder
| 4 |    2    | Cody Munn  | Coder    |  25 | Give that coder raise
| 3 |    3    | Jane Munn  | Admn Asst|  20 | Jane marries Cody <3
| 2 |    4    | Mark May   | Dev Lead |  40 | Promote mark to Dev Lead
| 4 |    3    | Cody Munn  | Coder    |  30 | Give Cody a raise
| 2 |    5    | Mark May   | Retired  |   0 | Mark retires
| 5 |    1    | Joey Trib  | Dev Lead |  40 | Bring outside help for Dev Lead
| 6 |    1    | Hire Meplz | Coder    |  10 | Hire a cheap coder
| 3 |    4    | Jane Munn  | Retired  |   0 | Jane quits
| 7 |    1    | Work Fofre | Admn Asst|  10 | Hire Janes replacement
| 8 |    1    | Fran Hesky | Coder    |  10 | Hire another coder
| 9 |    1    | Deby Olav  | Coder    |  25 | Hire another coder
| 4 |    4    | Cody Munn  | VP Ops   |  80 | Promote Cody
| 9 |    2    | Deby Olav  | VP Ops   |  80 | Cody fails at VP Ops, promote Deby
| 4 |    5    | Cody Munn  | Retired  |   0 | Cody retires in shame
| 5 |    2    | Joey Trib  | Dev Lead |  50 | Give Joey a raise
+---+---------+------------+----------+-----+

Now, if I wanted to do something like "Get a list of the current coders" I couldn't just do SELECT * FROM EMPLOYEE WHERE Position = 'Coder' because that would return lots of historical data... which is bad.

I'm looking for good ideas to handle this scenario. I see a few options that jump out at me, but I'm sure someone's going to say "Wow, that's a rookie mistake, glow... try this on for size:" which is what this place is all about, right? :-)

Idea number 1: Keep a version table with the current version like this

TABLE EMPLOYEE_VERSION:

|ID |VERSION|
+---+-------+
| 1 |   3   |
| 2 |   5   |
| 3 |   4   |
| 4 |   6   |
| 5 |   2   |
| 6 |   1   |
| 7 |   1   |
| 8 |   1   |
| 9 |   2   |     
+---+-------+

Although I'm not sure how I'd do that with a single query, I'm sure it could be done, and I bet I could figure it out with a rather small amount of effort.

Of course, I would have to update this table every time I insert into the EMPLOYEE table to increment the version for the given ID (or insert into the version table when a new id is made).

The overhead of that seems undesireable.

Idea number 2: Keep an archive table and a main table. Before updating the main table, insert the row I'm about to overwrite into archive table, and use the main table as I normally would as if I wasn't concerned about versioning.

Idea number 3: Find a query that adds something along the lines of SELECT * FROM EMPLOYEE WHERE Position = 'Coder' and version=MaxVersionForId(EMPLOYEE.ID)... Not entirely sure how I'd do this. This seems the best idea to me, but I'm really not sure at this point.

Idea number 4: Make a column for "current" and add "WHERE current = true AND ..."

It occurs to me that surely people have done this before, run into these same problems, and have insight on it to share, and so I come to collect that! :) I've tried to find examples of the problem on here already, but they seems specialized to a particular scenario.

Thanks!

EDIT 1:

Firstly, I appreciate all answers, and you've all said the same thing - DATE is better than VERSION NUMBER. One reason I was going with VERSION NUMBER was to simplify the process of updating in the server to prevent the following scenario

Person A loads employee record 3 in his session, and it has version 4. Person B loads employee record 3 in his session, and it has version 4. Person A makes changes and commits. This works because the most recent version in the database is 4. It is now 5. Person B makes changes and commits. This fails because the most recent version is 5, while his is 4.

How would the EFFECTIVE DATE pattern address this issue?

EDIT 2:

I think I could do it by doing something like this: Person A loads employee record 3 in his session, and it's effective date is 1-1-2010, 1:00 pm, with no experation. Person B loads employee record 3 in his session, and it's effective date is 1-1-2010, 1:00 pm, with no experation. Person A makes changes and commits. The old copy goes to the archive table (basically idea 2) with an experation date of 9/22/2010 1:00 pm. The updated version of the main table has an effective date of 9/22/2010 1:00 pm. Person B makes changes and commits. The commit fails because the effective dates (in the database and session) don't match.

Answer

NotMe picture NotMe · Sep 22, 2010

I think you've started down the wrong path.

Typically, for versioning or storing historical data you do one of two (or both) things.

  1. You have a separate table that mimics the original table + a date/time column for the date it was changed. Whenever a record is updated, you insert the existing contents into the history table just prior to the update.

  2. You have a separate warehouse database. In this case you can either version it just like in #1 above OR you simply snapshot it once every so often (hourly, daily, weekly..)

Keeping your version number in the same table as your normal one has several problems. First, the table size is going to grow like crazy. This will put constant pressure on normal production queries.

Second, it's going to radically increase your query complexity for joins etc in order to make sure the latest version of each record is being used.