views:

121

answers:

2

I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.

Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.

A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.

In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.

So I was wondering if anyone has come across something similar before and how they approached it?

Update As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)

select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)

Update It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.

+2  A: 

There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".

The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).

The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.

To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.

Aaron Digulla
These are both good ways to do versioning, and in fact we currently do what you suggest in your first answer, but as I mentioned in the question this is quite slow. The query to, for example, get everyone's name as of yesterday (not the most up to date version) would involve selecting max(version) where created <= yesterday and then using those versions to look up object keys to get the name. From experience this is quite slow.
Dave
+1 - for large amounts of heavily versioned data, if you want to use an RDBMS, performance will probably be a challenge. You'll just have to tune it like crazy, use horizontal partitioning, and count on ninja DBA tricks to meet your performance goals.
Jeff Sternal
@Dave: I suggest you post a full example of your query, the tables involved, plus the name/version of the database so we can find out if we can improve something.
Aaron Digulla
A: 

Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&amp;rep=rep1&amp;type=pdf

I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!

bdonlan