views:

30

answers:

3

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.

I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.

+1  A: 

A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.

You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.

To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.

Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.

Peter Tillemans
How would an index on the `deleted` column help? If the query engine uses this index, wouldn't it have to do a table-scan for all the other columns? Wouldn't it be better to index the field you're searching for and just discard the deleted rows from the search results?
Jeffrey L Whitledge
No, it would use an index scan on the deleted column and then search in the other columns. You can have multiple indexes and the query optimizer will use statistics collected by the database to estimate the relative performance of each index and put them in the order so that the index expected to give the least results is used first. Try EXPLAIN PLAN on some queries to see how this works. The optimizers are usually so smart that they know it is not worth using the index if only 10% of records are actually deleted.
Peter Tillemans
So let me restate. I would have two tables one with the staff members unchanging id and one with the the mutable staff details. That Staff details table would have an indexed field (deleted) that I would set when a new record is inserted for that staff member. When I want to query the most recent version I just limit to where deleted = 0. did I miss anything?
Tyson of the Northwest
nope. that covers it.
Peter Tillemans
+2  A: 

I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.

With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.

You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)

With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.

HLGEM
+1 You said what I was feeling. :-)
Jeffrey L Whitledge
I looks like I'll have at least two tables, one that contains the immutable id for the staff member and one with the mutable staff member information. That way I never have to delete a record, since each record will have the immutable id set as a foreign key. I am wondering if I should break up the records beyond two tables into tables where the fields are grouped by abstract type or likely hood uf being updated.
Tyson of the Northwest
A: 

@Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.

The structure of a database should reflect the real-world facts that are being modeled.

I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.

Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

Jeffrey L Whitledge