views:

1240

answers:

4

I've read a few questions on SO (such as this one) in regards to versioning your data within a database.

I liked some of the suggestions that were mentioned. I have for the longest time wanted (needed) to revision many of my tables but never got around to it. Being a programmer with only simple database work under my belt I was wondering how one would actually go about doing this.

I'm not asking for the actual solution in SQL syntax. I can eventually figure that out for myself (or post SO when the time comes). I'm just asking for people to comment as how they would go about doing it and any potential performance problems there might be if I was to 'revision' hundreds of million of records. Or any other suggestions as long as it is based on the example below.

Given a simple example:

Person
------------------------------------------------
ID                UINT NOT NULL,
PersonID          UINT NOT NULL,
Name              VARCHAR(200) NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL

Audit
------------------------------------------------
ID                UINT NOT NULL,
UserID            UINT NOT NULL,               -- Who
TableName         VARCHAR(50) NOT NULL,        -- What
OldRecID          UINT NOT NULL,               -- Where
NewRecID          UINT NOT NULL,
AffectedOn        DATE NOT NULL,               -- When
Comment           VARCHAR(500) NOT NULL        -- Why

I'm not sure how one would link the Audit table to any other tables (such as Person) if the TableName is a string?

Also, assuming that I have three GUI's to populate:

  1. A full record for a specific person id
  2. A table view listing all persons (by id)
  3. A view showing each person with their revision info below each entry (# of revisions per person, dates of revisions, revision comments, etc), ordered by the most recent revisions.

To accomplish 1 and 2, would it be better to query the Person table or the Audit table?

To accomplish 3, would a so called database expert simply get all records and pass it on to the software for processing, or group by PersonID and Affected date? Is this usually handled in one query or many?

+7  A: 

I have done various audit schemes over the years and I am currently going to implement something like this:

Person
------------------------------------------------
ID                UINT NOT NULL,
PersonID          UINT NOT NULL,
Name              VARCHAR(200) NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL


Person_History
------------------------------------------------
ID                UINT NOT NULL,
PersonID          UINT NOT NULL,
Name              VARCHAR(200) NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL
AuditID           UINT NOT NULL


Audit
------------------------------------------------
ID                UINT NOT NULL,
UserID            UINT NOT NULL,               -- Who
AffectedOn        DATE NOT NULL,               -- When
Comment           VARCHAR(500) NOT NULL        -- Why

The current records are always in the Person table. If there is a change an audit record is created and the old record is copied into the Person_History table (note the ID does not change and there can be multiple versions)

The Audit ID is in the *_History tables so you can link multiple record changes to one audit record if you like.

EDIT:
If you don't have a separate history table for each base table and want to use the same table to hold old and "deleted" records then you have to mark the records with a status flag. The problem with that it's a real pain when querying for current records - trust me I've done that.

DJ
Thanks for the answer! That was the way that I had envisioned it at first, but since I have over 100 tables to revision, I wanted to avoid also having 100 revision tables and 100 audit tables. This is why I was trying to share the audit table across all my tables and also avoid duplicating each table. Maybe my goal is unreasonable, but this is why I'm exploring and researching on this. Again, thanks!
Jeach
+1 for the history table per base table. There is often a desire to have one Audit table. This can lead to serious performance issues if the system is fairly busy. The audit table becomes a bottleneck for each transaction. Quite simply if transaction #2 is waiting for trans #1 to complete your user will experience a slow down
Karl
You only need one Audit table - but you still need a History table for each of your main tables
DJ
In my case the audit table is just the date/time and user info
DJ
Great point Karl... that's the kind of info I was looking for. I was planning on using transactions more extensively which may lead to bottleneck problems as you have mentioned. I will have to take that into consideration. Possibly even run benchmarks to assess my real-world use. I'm willing to accept more schema complexity but I'm not willing to sacrifice once ounce of performance. Thanks!
Jeach
if the person is deleted. will the record deleted from person table and person_history table? what should be used as indication in audit table for deleted record?
cometta
Yes the record is deleted from the person table but is moved to the history table. I suppose there could be an indicator on the history table to indicate it's a deleted record.
DJ
A: 

Following DJ's post in using a history table per base table and a comment by Karl about possible performance issues, I've done a bit of SQL research in order to figure out the fastest possible way to transfer a record from one table to another.

I just wanted to document what I found:

I thought that I would have to do an SQL fetch to load the record from the base table, followed with an SQL push to put the record into the history table, followed by an update to the base table to insert the changed data. Total of 3 transactions.

But to my surprise I realized that you can do the first two transactions using one SQL statement using the SELECT INTO syntax. I'm betting performance would be a hundred fold faster doing this.

Then that would leave us to simply UPDATE the record with the new data within the base table.

I still haven't found one SQL statement to do all 3 transactions at once (I doubt I will).

Jeach
You won't find a single statement to do all at once. You can simulate three at once with a transaction or a savepoint.
Jonathan Leffler
+3  A: 

How about you create the table as normal, have a ModifiedDate column on each record (and ModifiedBy if you like), and do all your data access through a materialized view which groups the data by Id and then does a HAVING ModifiedDate = MAX(ModifiedDate)?

This way, adding a new record with the same Id as another will remove the old record from the view. If you want to query history, don't go through the view

I've always found maintaining different tables with the same columns to be complex and error prone.

mcintyre321
A: 

I like your audit table, its a good start. You've got a cardinality issue with your audit table, so I would bust it out as two tables:

Person
------------------------------------------------
ID                UINT NOT NULL,
PersonID          UINT NOT NULL,
Name              VARCHAR(200) NOT NULL,
DOB               DATE NOT NULL,
Email             VARCHAR(100) NOT NULL,
AuditID           UINT NOT NULL 

Audit
------------------------------------------------
ID                UINT NOT NULL,
TableName         VARCHAR(50) NOT NULL,        -- What
TableKey          UINT NOT NULL,
CreateDate        DATETIME NOT NULL  DEFAULT(NOW),
CreateUserID      UINT NOT NULL,
ChangeDate        DATETIME NOT NULL  DEFAULT(NOW),
ChangeUserID      UINT NOT NULL

Audit_Item
------------------------------------------------
ID                UINT NOT NULL,
AuditID           UINT NOT NULL,               -- Which audit record
UserID            UINT NOT NULL,               -- Who
OldRecID          UINT NOT NULL,               -- Where
NewRecID          UINT NOT NULL,
AffectedOn        DATE NOT NULL,               -- When
Comment           VARCHAR(500) NOT NULL        -- Why

The initial layout proposed has a single Audit record that points back to (I assume) two Person records. The challenges of this design are:

  • Which records in your person table are the current 'real' records?
  • How do you represent the entire history of changes to the Person record? If you are pointing to two records in the Person table, then see point #1: which one is the current record?
  • The Create*, Change* fields are rolled up from a collection of Audit_Item records. They are only there for ease of access.
  • The AuditID key in the Person table allows you to point back to the Audit table and get to the history of the individual Person without needing to query the Audit table with the clause WHERE TableName='Person'
Jeff Fritz
I've been staring at your 'cardinality' issue for a while and I just can't seem to understand your intended benefits? Isn't UserID, CreateUserID and ChangeUserID duplicates of the same information? Isn't AffectedOn, CreateDate and ChangeDate also duplicates? Alos, isn't this adding another SQL transaction every time a person gets edited? Can you edit your post to explain your proposed enhancement... thanks!
Jeach