views:

720

answers:

8

How to design a database that supports a feature that would allow the application user to create a snapshot of their data at a point in time, a bit like version control.

It would give the user the ability to go back and see what their data looked like in the past.

Assume that the data being "snapshotted" is complex and includes joins of multiple tables.

I'm looking for a way to give each application user the ability to snapshot their data and go back to it. Whole database snapshots is not what I'm looking for.

EDIT: Thanks for your answers. The 6NF answer is compelling as is the suggestion to de-normalise the snapshot data due to its simplicity.

Clarification: this is not a data warehousing question, nor is it a question about DB backup and restore; its about how to build a schema that allows us to capture the state of a specific set of related data at a point in time. The snapshots are generated by the application users when they see fit. Users do not snapshot the entire DB, just the data object they are interested in.

A: 

SQL Server 2005 (onwards) Enterprise Edition has the ability to create Database snapshots

Mitch Wheat
A: 

Oracle from version 9i have Flashback technology, which is in Oracle 10g and 11g much improved and you can see state of your database at any given point in history, provided you enable flashback.

Check this document: Flashback Overview

zendar
A: 

You can use the logs produced by your RDBMS to obtain snapshots of your data. Normally the logs are used to provide database recovery. They can however also be used to replicate the data across several RDBMS instances or to get snapshots of the data.

To get a snapshot of your data, simply take into account all the logs produced before the desired moment in time. You then "play-back" those logs to obtain the actual database with your data restored.

How to access and "play-back" the logs depends on the concrete RDBMS product you use.

Another possibility is to use temporal databases. They have time-aspects built in and allow "going-back-in-time". Look for "Oracle Flashback Technology" for example. http://www.stanford.edu/dept/itss/docs/oracle/10g/appdev.101/b10795/adfns_fl.htm#ADFNS1008

lewap
+3  A: 

We did this once by creating separate database tables that contained the data we wanted to snapshot, but denormalized, i.e. every record contained all data required to make sense, not references to id's that may or may no longer exist. It also added a date to each row.

Then we produced triggers for specific inserts or updates that did a join on all affected tables, and inserted it into the snapshot tables.

This way it would be trivial to write something that restored the users' data to a point in time.

If you have a table:

user:

id, firstname, lastname, department_id

department:

id, name, departmenthead_id

your snapshot of the user table could look like this:

user_id, user_firstname, user_lastname, department_id, department_name, deparmenthead_id, deparmenthead_firstname, departmenthead_lastname, snapshot_date

and a query something like

INSERT INTO usersnapshot
SELECT user.id AS user_id, user.firstname AS user_firstname, user.lastname AS user_lastname,
department.id AS department_id, department.name AS department_name
departmenthead.id AS departmenthead_id, departmenthead.firstname AS departmenthead_firstname, departmenthead.lastname AS departmenthead_lastname,
GETDATE() AS snapshot_date
FROM user
INNER JOIN department ON user.department_id = department.id
INNER JOIN user departmenthead ON department.departmenthead_id = departmenthead.id

This ensures each row in the snapshot is true for that moment in time, even if department or department head has changed in the meantime.

Kamiel Wanrooij
+4  A: 

This is NOT easy.

You're essentially asking for a Temporal Database (What Christopher Date calls Sixth Normal Form, or 6NF).

To be 6NF, a schema must also be 5NF, and, basically, for each datum, you need to attach a time range for which the datum at that value is applicable. Then in joins, the join must include only the rows that are within the time range being considered.

Temporal modeling is hard -- it's what 6th Normal Form addresses -- and not well supported in current RDBMSes.

The problem is the granularity. 6th Normal Form (as I understand it) supports temporal modeling by making every non-key (non-key:, i.e., anything "on" the entity that can change without the entity losing its identity) a separate relation. To this, you add a timestamp or time range or version number. Making everything a join solves the granularity problem, but it also means your queries are going to be more complicated and slower. It also requires figuring out all keys and non-key attributes; this tends to be a large effort.

Basically, everywhere you have a relation ("ted owns the GM stock certificate with id 789") you add a time: "ted owns the GM stock certificate with id 789 now" so that you can simultaneously say, "fred owns the GM stock certificate with id 789 from 3 Feb 2000 to yesterday". Obviously these relations are many-to-many, (ted can own more than one certificate now, and more than one over his lifetime, too, and fred can have previously owned the certificate jack owns now).

So we have a table of owners, and a table of stock certificates, and a many-to-many table that relates owners and certificates by id. To the many-to-many table, we add a start_date and an end_date.

Now, imagine that each state/province/land taxes the dividends on stock certificates, so for tax purposes to record the stock certificate's owner's state of residency.

Where the owner resides can obviously change independently with stock ownership; ted can live in Nebraska, buy 10 shares, get a dividend that Nebraska taxes, move to Nevada, sells 5 shares to fred, buy 10 more shares.

But for us, it's ted can move to Nebraska at some time, buy 10 shares at some time, get a dividend at some time, which Nebraska taxes, move to Neveda at some time, sell 5 shares to fred at some time, buy 10 more shares at some time.

We need all of that if we want to calculate what taxes ted owes in Nebraska and in Nevada, joining up on the matching/overlapping date ranges in person_stockcertificate and person_address. A person's address is no longer one-to-one, it's one-to-many because it's address during time range.

If ted buys ten shares, do we model a buy event with a single purchase date, or do we add a date_bought to each share? Depends on the question we need the model to answer.

tpdi
A generic solution is definitely hard to accomplish, but for a specific enough use case and subset of your data, a solution like I described could be acceptable.
Kamiel Wanrooij
A: 

With SQL Server at least, you can use Full logging and keep the transaction logs between each backup set.

Then you can do an point-in-time backup.

That's a poor solution.

What exactly does your client want? Is it for analytical purposes (i.e. the questions are like how many orders did we have two weeks ago) ? Because that's exactly the problem that a datawarehouse solves.

Ryan Barrett
+1  A: 

Having snapshots and/or an audit trail is a common database requirement. For many applications, creating 'shadow' or audit tables is an easy and straight forward task. While database level backups and transaction logs are good to have, they are not a version control system.

Basically, you need create a shadow table with all the same columns as the base table, and then setup triggers on the base table to place a copy of the row in the shadow table when ever it is updated or deleted.

Through some logic you can recreate what the data looked like at a given point in time. For an easy way to set this up in Sybase see: http://www.theeggeadventure.com/wikimedia/index.php/Sybase_Tips#create_.27audit.27_columns

If you need to do lots of historical snapshots, then you can keep the data in the same table. Basically, create two columns - an added and deleted column. The downside is for every query you must add a where clause. Of course you can create a view, which shows just the active records. This gets a bit more complicated if you have a normalized database with multiple tables, all with history.

However, it does work. You simply have the 'added' and 'deleted' columns on each table, and then your query has the point in time of interest. Whenever data is modified you must copy the current row, and mark it as deleted.

brianegge
A: 

I have a question not an answer! How would I go about developing a database for the following: An html form has 3 text fields username date comment

The username is a unique id How would I go about adding each comment to the user based on the date. So that in the future I can query the comments made to that user in that particular date.

Thank you in advance!

You should create a new question for this
Jared Russell