Database Content Versioning

user800576 picture user800576 · Jun 16, 2011 · Viewed 22k times · Source

I am interested in keeping a running history of every change which has happened on some tables in my database, thus being able to reconstruct historical states of the database for analysis purposes.

I am using Postgres, and this MVCC thing just seems like I should be able to exploit it for this purpose but I cannot find any documentation to support this. Can I do it? Is there a better way?

Any input is appreciated!

UPD

I have marked Denis' response as the answer, because he did in fact answer whether MVCC is what I want which was the question. However, the strategy I have settled on is detailed below in case anyone finds it useful:

The Postgres feature that does what I want: online backup/point in time recovery.

http://www.postgresql.org/docs/8.1/static/backup-online.html explains how to use this feature but essentially you can set this "write ahead log" to archive mode, take a snapshot of the database (say, before it goes live), then continually archive the WAL. You can then use log replay to recall the state of the database at any time, with the side benefit of having a warm standby if you choose (by continually replaying the new WALs on your standby server).

Perhaps this method is not as elegant as other ways of keeping a history, since you need to actually build the database for every point in time you wish to query, however it looks extremely easy to set up and loses zero information. That means when I have the time to improve my handling of historical data, I'll have everything and will therefore be able to transform my clunky system to a more elegant system.

One key fact that makes this so perfect is that my "valid time" is the same as my "transaction time" for the specific application- if this were not the case I would only be capturing "transaction time".

Before I found out about the WAL, I was considering just taking daily snapshots or something but the large size requirement and data loss involved did not sit well with me.

For a quick way to get up and running without compromising my data retention from the outset, this seems like the perfect solution.

Answer

Craig Ringer picture Craig Ringer · Sep 27, 2012

Time Travel

PostgreSQL used to have just this feature, and called it "Time Travel". See the old documentation.

There's somewhat similar functionality in the spi contrib module that you might want to check out.

Composite type audit trigger

What I usually do instead is to use triggers to log changes along with timestamps to archival tables, and query against those. If the table structure isn't going to change you can use something like:

CREATE TABLE sometable_history(
    command_tag text not null check (command_tag IN ('INSERT','DELETE','UPDATE','TRUNCATE')),
    new_content sometable,
    change_time timestamp with time zone
);

and your versioning trigger can just insert into sometable_history(TG_OP,NEW,current_timestamp) (with a different CASE for DELETE, where NEW is not defined).

hstore audit trigger

That gets painful if the schema changes to add new NOT NULL columns though. If you expect to do anything like that consider using a hstore to archive the columns, instead of a composite type. I've already added an implementation of that on the PostgreSQL wiki already.

PITR

If you want to avoid impact on your master database (growing tables, etc), you can alternately use continuous archiving and point-in-time recovery to log WAL files that can, using a recovery.conf, be replayed to any moment in time. Note that WAL files are big and they include not only the tuples you changed, but VACUUM activity and other details. You'll want to run them through clearxlogtail since they can have garbage data on the end if they're partial segments from an archive timeout, then you'll want to compress them heavily for long term storage.