views:

81

answers:

4

For example I have a table which stores details about properties. Which could have owners, value etc.

Is there a good design to keep the history of every change to owner and value. I want to do this for many tables. Kind of like an audit of the table.

What I thought was keeping a single table with fields

table_name, field_name, prev_value, current_val, time, user.

But it looks kind of hacky and ugly. Is there a better design?

Thanks.

+2  A: 

In our projects we usually do it this way: You have a table

properties(ID, value1, value2)

then you add table

properties_audit(ID, RecordID, timestamp or datetime, value1, value2)

ID -is an id of history record(not really required)

RecordID -points to the record in original properties table.

when you update properties table you add new record to properties_audit with previous values of record updated in properties. This can be done using triggers or in your DAL.

After that you have latest value in properties and all the history(previous values) in properties_audit.

Kirill Muzykov
+3  A: 

I think a simpler schema would be

table_name, field_name, value, time, userId

No need to save current and previous values in the audit tables. When you make a change to any of the fields you just have to add a row in the audit table with the changed value. This way you can always sort the audit table on time and know what was the previous value in the field prior to your change.

Faisal Feroz
you're right, seems unnecessary. Just typed in without thinking.
saint
just to emphasize a few bad points of this schema (which might or might not apply). Using info from this table on attribute rich tables is general a pain in the neck. If you will want to show your data you will need to cast values to appropriate types, then collect all attributes and basically turn them into a structure that resembles the original record (depends on the usage, but probably true as a rule). If you don't consider this a problem (let's say you automate the client) then neither verbose tables is, as you can automate maintenance of those (creation tables and triggers with a script)
Unreason
+3  A: 

There are a few approaches

Field based

audit_field (table_name, id, field_name, field_value, datetime)

This one can capture the history of all tables and is easy to extend to new tables. No changes to structure is necessary for new tables.

Field_value is sometimes split into multiple fields to natively support the actual field type from the original table (but only one of those fields will be filled, so the data is denormalized; a variant is to split the above table into one table for each type).

Other meta data such as field_type, user_id, user_ip, action (update, delete, insert) etc.. can be useful.

The structure of such records will most likely need to be transformed to be used.

Record based

audit_table_name (timestamp, id, field_1, field_2, ..., field_n)

For each record type in the database create a generalized table that has all the fields as the original record, plus a versioning field (additional meta data again possible). One table for each working table is necessary. The process of creating such tables can be automated.

This approach provides you with semantically rich structure very similar to the main data structure so the tools used to analyze and process the original data can be easily used on this structure, too.

Log file

The first two approaches usually use tables which are very lightly indexed (or no indexes at all and no referential integrity) so that the write penalty is minimized. Still, sometimes flat log file might be preferred, but of course functionally is greatly reduced. (Basically depends if you want an actual audit/log that will be analyzed by some other system or the historical records are the part of the main system).

Unreason
this seems complete. I wonder what performance would be like for each approach. So in record based, there will be a mirror table (in terms of fields) with a version field?
saint
@saint, performance is a composite, it depends on the usage patterns. Log file will be the fastest to write down (esp. if separate partition or separate node). And for sure you will be writing a lot. Now, how important is reading? If you are going to read once in a week then it is a different thing compared to inspecting timeline correlation between various fields or doing some other data mining (as an extreme example).
Unreason
@saint, yes, in record based you extend the tables with versioning field (and possibly other interesting metadata such as user).
Unreason
With record based you have to be much more careful in querying to make sure you have the current record. Many errors are as a result of a programmer who forgot to do this. If you use this approach make sure you create a view of the active records and have the developers use that one unless they specifically are looking for history.
HLGEM
@HLGEM, just to clarify - the 'record based' approach does not specify that the full history table will replace the working table (nor I think it should). I think that your active records should still be in a working table (active records usually have completely different usage pattern and the application will benefit from this partitioning, especially on the busy database with lot of corrections; so I would vote against the view as the rule of a thumb)
Unreason
+3  A: 

A different way to look at this is to time-dimension the data.

Assuming your table looks like this:

create table my_table (
my_table_id      number        not null primary key,
attr1            varchar2(10)  not null,
attr2            number            null,
constraint my_table_ak unique (attr1, att2) );

Then if you changed it like so:

create table my_table (
my_table_id      number        not null,
attr1            varchar2(10)  not null,
attr2            number            null,
effective_date   date          not null,
is_deleted       number(1,0)   not null default 0,
constraint my_table_ak unique (attr1, att2, effective_date)
constraint my_table_pk primary key (my_table_id, effective_date) );

You'd be able to have a complete running history of my_table, online and available. You'd have to change the paradigm of the programs (or use database triggers) to intercept UPDATE activity into INSERT activity, and to change DELETE activity into UPDATing the IS_DELETED boolean.


Unreason:

You are correct that this solution similar to record-based auditing; I read it initially as a concatenation of fields into a string, which I've also seen. My apologies.

The primary differences I see between the time-dimensioning the table and using record based auditing center around maintainability without sacrificing performance or scalability.

Maintainability: One needs to remember to change the shadow table if making a structural change to the primary table. Similarly, one needs to remember to make changes to the triggers which perform change-tracking, as such logic cannot live in the app. If one uses a view to simplify access to the tables, you've also got to update it, and change the instead-of trigger which would be against it to intercept DML.

In a time-dimensioned table, you make the strucutural change you need to, and you're done. As someone who's been the FNG on a legacy project, such clarity is appreciated, especially if you have to do a lot of refactoring.

Performance and Scalability: If one partitions the time-dimensioned table on the effective/expiry date column, the active records are in one "table", and the inactive records are in another. Exactly how is that less scalable than your solution? "Deleting" and active record involves row movement in Oracle, which is a delete-and-insert under the covers - exactly what the record-based solution would require.

The flip side of performance is that if the application is querying for a record as of some date, partition elimination allows the database to search only the table/index where the record could be; a view-based solution to search active and inactive records would require a UNION-ALL, and not using such a view requires putting the UNION-ALL in everywhere, or using some sort of "look-here, then look-there" logic in the app, to which I say: blech.

In short, it's a design choice; I'm not sure either's right or either's wrong.

Adam Musch
Use a view to get only the active ones. Another issue with this approach is the tables can get very large, very fast. You will have to be good at performance tuning.
HLGEM
However, if the history needs to be *easily* accessible, this is a lot easier than unscrambling the field-based or record-based auditing options and eliminates the nasty complexity of remembering what tables you've got shadows of, as refactoring of one requires refactoring of the other. (Partitioning an `expiry_date` instead, using NULL to indicate a current record, eliminates the need for a view.)
Adam Musch
this looks clean, I'm not sure i understand the use of unique effective date?
saint
@Adam, advantages over field-based are clear, but I would not compare field based and record-based in terms of unscrambling/decoding; there is not a lot of difference between your approach and record-based (in fact I consider your solution a variation of a record based approach, where you avoided to partition the tables). This does make the solution less scalable - the impact depends on the ratio of active vs historical records, so it might be negligible. Also, it does _not_ really make things semantically clearer. So, you hurt performance and clarity, while flexibility is almost the same.
Unreason
@Adam, otherwise +1, it is not bad solution and it is concisely explained
Unreason
@saint: Adding effective_date or expiry_date to the unique constraints allows one to show the versions of the record; otherwise there could be only one record with ID = 4, instead of ID = 4 as of Jan 1, as of Feb 10, ...
Adam Musch
Although I like this approach, I hate the fact that your primary key is now composite, which can cause quite some problems with referential integrity. Nevertheless, +1.
Hosam Aly