How to Store Historical Data

Aaron picture Aaron · Oct 6, 2010 · Viewed 126.1k times · Source

Some co-workers and I got into a debate on the best way to store historical data. Currently, for some systems, I use a separate table to store historical data, and I keep an original table for the current, active record. So, let's say I have table FOO. Under my system, all active records will go in FOO, and all historical records will go in FOO_Hist. Many different fields in FOO can be updated by the user, so I want to keep an accurate account of everything updated. FOO_Hist holds the exact same fields as FOO with the exception of an auto-incrementing HIST_ID. Every time FOO is updated, I perform an insert statement into FOO_Hist similar to: insert into FOO_HIST select * from FOO where id = @id.

My co-worker says that this is bad design because I shouldn't have an exact copy of a table for historical reasons and should just insert another record into the active table with a flag indicating that it's for historical purposes.

Is there a standard for dealing with historical data storage? It seems to me that I don't want to clutter my active records with all of my historical records in the same table considering that it may be well over a million records (I'm thinking long term).

How do you or your company handle this?

I'm using MS SQL Server 2008, but I'd like to keep the answer generic and arbitrary of any DBMS.

Answer

ConcernedOfTunbridgeWells picture ConcernedOfTunbridgeWells · Oct 6, 2010

Supporting historical data directly within an operational system will make your application much more complex than it would otherwise be. Generally, I would not recommend doing it unless you have a hard requirement to manipulate historical versions of a record within the system.

If you look closely, most requirements for historical data fall into one of two categories:

  • Audit logging: This is better off done with audit tables. It's fairly easy to write a tool that generates scripts to create audit log tables and triggers by reading metadata from the system data dictionary. This type of tool can be used to retrofit audit logging onto most systems. You can also use this subsystem for changed data capture if you want to implement a data warehouse (see below).

  • Historical reporting: Reporting on historical state, 'as-at' positions or analytical reporting over time. It may be possible to fulfil simple historical reporting requirements by quering audit logging tables of the sort described above. If you have more complex requirements then it may be more economical to implement a data mart for the reporting than to try and integrate history directly into the operational system.

    Slowly changing dimensions are by far the simplest mechanism for tracking and querying historical state and much of the history tracking can be automated. Generic handlers aren't that hard to write. Generally, historical reporting does not have to use up-to-the-minute data, so a batched refresh mechanism is normally fine. This keeps your core and reporting system architecture relatively simple.

If your requirements fall into one of these two categories, you are probably better off not storing historical data in your operational system. Separating the historical functionality into another subsystem will probably be less effort overall and produce transactional and audit/reporting databases that work much better for their intended purpose.