views:

2285

answers:

10

Let's say that I have a record in the database and that both admin and normal users can do updates.

Can anyone suggest a good approach/architecture how to version control every change in this table so it's possible to rollback a record to a previous revision.

Thanks /Niels

A: 

Usually you version control the SQL scripts that modify the database schema, but you do not version control the modification of the values put in the database.

That would be the task of your backup mechanism to be able to extract any value from a previous state of your database, through daily incremental backup or weekly full backup.

The post from Codding Horror mentionned by Yuval F is not a good answer, as it suggests version control could also be used for data as well as schema.

As said in the comments of that post:

As others have pointed out, the analogy of "source control" to databases breaks down in several instances: branches, rollbacks, test cases, patching, and so on.

Code is stateless and data is, well, state; because data is intermingled with schema, you sometimes have to treat the schema as immutable.

That's not to say that effective database versionning isn't a positive goal, but it's a lofty one and it's important not to fool ourselves into believing that we've attained it. What you get with existing tools is a very limited kind of version control: the ability to create a new database from scratch that's compatible with any given version. That's all. It's great if you do a lot of new rollouts, but only marginally useful if you work on a long-term production system.

As long as it's understood that having your database scripts in SCM isn't going to afford the same protections as having your code in there, and that you should still be very conservative with schema changes, I think it's probably a good thing to do if you have more than 3 or 4 developers. If you're a pair or a one-man show, it's probably not worth the effort, because the frequency of changes should be rare.

Since I assist well over 50 projects with a lots of associated database, I can assure you that for large projects, Version Control Tools are simply not made for any kind of versionning of data from the database.
Those kind of data evolve at a very different pace than the rest of your development.

SCM can be suited for storing scripts of upgrades (or even downgrades) of schema.
Data are managed with backups.

VonC
I don't think this question is about source control of schema design, I think it's about changes made by the application to database rows.
WW
"changes made by the application to database rows" is about data, is it not ? Hence my answer. SCM is not made for data versionning coming from database.
VonC
A: 

Here's a start, in a post by one of SO's founders.

Yuval F
That is about putting the entire database under version control, not the actual data in it.
Kristoffer L
+6  A: 

I think you are looking for versioning the content of database records (as StackOverflow does when someone edits a questrion/answer) a good starting point might be look some database model that uses that revision tracking.

The best example that comes to my mind I think is MediaWiki (Wikipedia engine), look at the database diagram here, check how the revision table its related.

Depending what technologies you're using you'll have to find some good diff/merge algorithms.

Check this question if it's for .NET.

CMS
+4  A: 

In the BI world, you could accomplish this by adding a startDate and endDate to the table you want to version. When you insert the first record into the table, the startDate is populated, but the endDate is null. When you insert the second record, you also update the endDate of the first record with the startDate of the second record.

When you want to view the current record, you select the one where endDate is null.

This is sometimes called a type 2 Slowly Changing Dimension. See also TupleVersioning

ranomore
Won't my table grow quite large using this approach?
Niels Bosma
Yes, but you can deal with that by indexing and/or partitioning the table. Also, there will only be a small handful of large tables. Most will be much smaller.
ConcernedOfTunbridgeWells
+25  A: 

Let's say you have a FOO table that admins and users can update. Most of the time you can write queries against the FOO table. Happy days.

Then, I would create a FOO_HISTORY table. This has all the columns of the FOO table. The primary key is the same as FOO plus a RevisionNumber column. There is a foreign key from FOO_HISTORY to FOO. You might also add columns related to the revision such as the UserId and RevisionDate. Populate the RevisionNumbers in an ever-increasing fashion across all the *_HISTORY tables (ie. from an Oracle sequence or equivalent). Do not rely on there only being one change in a second. ie. do not put RevisionDate into the primary key.

Now, everytime you update FOO, just before you do the update you insert the old values into FOO_HISTORY. You do this at some fundamental level in your design so that programmers can't accidently miss this step.

If you want to delete a row from FOO you have some choices. Either cascade and delete all the history, or perform a logical delete by flagging FOO as deleted.

This solution is good when you are largely interested in the current values and only occasionally in the history. If you always need the history then you can put effective start and end dates and keep all the records in FOO intself. Every query then needs to check those dates.

WW
You can do the audit table updating with database triggers if your data access layer doesn't directly support it. Also, it's not hard to build a code generator to make the triggers that uses introspection from the system data dictionary.
ConcernedOfTunbridgeWells
I woyuld recommend that you actually insert the _new_ data, not the previous, so the history table has all of the data. Although it stores redyundent data, it eliminates the special cases required to deal with searching in both tables when historical data is required.
Nerdfest
@Nerdfest - I meant insert all the current values into the history table, and then update the main table with new values.
WW
Personally I'd recommend not deleting anything (defer this to a specific housekeeping activity) and have an "action type" column to specify whether it is insert/update/delete. For a delete you copy the row as normal, but put "delete" in the action type column.
Neil Barnwell
A: 

I don't completely agree with the comment cited by VonC that data is (only) state as opposed to stateless code. Some data is actually stateless information, such as "magic values" that link code with data.

Unfortunately the question is not clear on whether its arbitrary data that is to be version controlled, or whether its these magic values.

I wrote an application to create data scripts (T-SQL for database contents, C# for numeric identifiers) which can be included in a VCS.

devio
I agree with you, but trying to manage those magic values in another referential (the VCS), and keep them in sync with their native referential (the database) is too much a hurdle when you have lots of projects and databases. Just my 2 cents though. In simpler context, that might be possible.
VonC
+1  A: 

You don't say what database, and I don't see it in the post tags. If it's for Oracle, I can recommend the approach that is built in in Designer: use journal tables. If it's for any other database, well, I basically recommend the same way, too...

The way it works, in case you want to replicate it in another DB, or maybe if you just want to understand it, is that for a table there is a shadow table created too, just a normal database table, with the same field specs, plus some extra fields: like what action was last taken (string, typical values "INS" for insert, "UPD" for update and "DEL" for delete), datetime for when the action took place, and user id for who did it.

Through triggers, every action to any row in the table inserts a new row in the journal table with the new values, what action was taken, when, and by what user. You don't ever delete any rows (at least not for the last few months). Yes it'll grow big, easily millions of rows, but you can easily track the value for any record at any point in time since the journaling started or the old journal rows got last purged, and who made the last change.

In Oracle everything you need is generated automatically as SQL code, all you have to do is to compile/run it; and it comes with a basic CRUD application (actually only "R") to inspect it.

bart
+1  A: 

Two options:

  1. Have a history table - insert the old data into this history table whenever the original is updated.
  2. Audit table - store the before and after values - just for the modified columns in an audit table along with other information like who updated and when.
alok
+1  A: 

You can perform auditing on a SQL table via SQL triggers. From a trigger you can access 2 special tables (inserted and deleted). These tables contain the exact rows that were inserted or deleted each time the table is updated. In the trigger SQL you can take these modified rows and insert them into the audit table. This approach means that your auditing is transparent to the programmer; requiring no effort from them or any implementational knowledge.

The added bonus of this approach is that the auditing will occur regardless of whether the sql operation took place via your data access DLLs, or via a manual SQL query; (as the auditing is performed on the server itself).

DoctaJonez
+1  A: 

Upgrade to SQL 2008.

Try using SQL Change Tracking, in SQL 2008. Instead of timestamping and tombstone column hacks, you can use this new feature for tracking changes on data in your database.

MSDN SQL 2008 Change Tracking

Devtron