views:

228

answers:

3

I will soon be beginning work on a project that (from the spec) reminds me a bit of StackOverflow. Basically, its a web app that has user-controlled content.

One of the features that's got me going around in circles in my mind is the version control. Here on StackOverflow, each question and answer can have multiple revisions. This is pretty simple to implement when you have only one type of object (and, in this case, its text).

So, for my simple pages, I'm set.

The problem comes in when I consider that some objects that need to be under version control have relationships. To provide a concrete example, let me choose a random analagous domain:

Lets say I was implementing a Wiki-like site for keeping track of book/author info. The primary focus of the site would be to create and update "Author" pages, which, as text, is pretty simple (as above). However, let us add a one-to-many association between authors and books (in other words, books would be separate objects, as obviously a person could author many books). Each book would have a link from the Author page to an informational page about that book.

To the user, there is little difference between the text-based "summary" describing the author and the links between that author & their works. Thus, we have a requirement to implement the "revision"/edit feature for author pages, book pages, and the association between authors and books. In other words, the user should be able to edit, view history of, and rollback author pages, book pages, and associations between the two.

This becomes even more complicated when that relationship becomes a many-to-many, where multiple authors could be listed as having contributed to a book.

I have a number of solutions in mind, but none of them are as clean as I'd like (and involve at least some repeated code/redundant data storage), and, although I do see commonality all over the place here, I feel that I haven't really been able to extract it best, especially at the database level. I don't want to bias the answers given so I'm not going to give them right away.

So, how would you design this system at the database level? I'm looking for table specifications here, and possibly a description of how you'd use them, if its not immediately obvious. For those answers to which it may be relevant, I'm going to be using ASP.NET and either Linq-to-SQL (I'm comfortable with many-to-many in LTS) or Entity Framework.

EDIT: To clarify, I understand basic DB design, normalization, many-to-many mapping tables, etc. I'm looking for a clean solution to this specific situation.

EDIT 2: I'm looking for a generalizable solution, as there may be way more sub-objects in the system than just books. The author may be related to other authors, magazines, events, etc, etc, etc. I feel like I'm repeating a lot of work if I implement history individually for each one.

A: 

I have a table for each table: i.e., Author and Book.

There's the usual foreign key relationship (whatever that is) between the tables.

Each table also has a history table, i.e. AuthorHistory and BookHistory. These history tables contain the old/obsolete versions of records (e.g. each deleted and/or edited Author record). There's no foreign key relationship to/from the history tables.


Edit:

Some functionality is similar for each table: for example, no matter which table, updating a record means storing the old copy of the record in te corresponding history table. I implement this functionality using database triggers (update and delete triggers for each table); because the database engine that I'm using supports triggers, and that makes it transparent to the application. The code within these triggers is similar from one table to the next (only the names of the table, and the list of field names, differs from one table to the next).


What about the many-to-many situation? This is more difficult because you could actually have no record mapping an author to a book, but have previously had one and need to show that as a history item

Edit#2:

I haven't implemented history of a many-to-many situation yet, but I don't see why it wouldn't be the same, i.e.:

  • The many-to-many relationship is implemented by there being a BookAuthor table, each of whose records is just BookId plus AuthorId.
  • Historical relationships are in a corresponding BookAuthorHistory table.
ChrisW
What about the many-to-many situation? This is more difficult because you could actually have no record mapping an author to a book, but have previously had one and need to show that as a history item.
JoshJordan
Indeed, you are right. Unfortunately, this is not a very general/scalable solution. It requires a new History table for each new table implemented.
JoshJordan
I don't see what's not general/scalable about that: IMO it's a "general" solution in the sense that it's a solution which works for any set of tables.
ChrisW
Its general from a design point of view, I'm looking for a general implementation.
JoshJordan
I don't know what you mean. Anyway, a different solution at the opposite end of the spectrum would be to have a single History table, which contains all historic field values from every other table in the database: http://en.wikipedia.org/wiki/Entity-attribute-value_model (I've seen it said that using EAV is a mistake which many programmers make, at some time in their careers).
ChrisW
+4  A: 

This is a common-enough problem in data warehousing. They use "slowly-changing dimensions".

There have to be some rules, however, if you're going to try and have "versioned" data.

  1. You must record the Author-Book relationship as initially defined. This the the official Author-Book relationship. It is something the data warehouse folks call a "fact-less fact table". It's pairs of keys.

  2. Books are a dimension of the book-author fact. The book can change. There are numerous slowly-changing dimension algorithms. You can keep only the latest, have history table separate from current. Keep history and current in one table with a flag to distinguish current from history.

  3. Authors are a dimension of the book-author fact. The author can change. Again, there are numerous SCD algorithms. Read on up on the choices. By Ralph Kimball's The Data Warehouse Toolkit for more information.

Note that the relationship (author to book) is a fact, and does not need versions. It's a fact. It doesn't "change". It's either true or it was put in the database in error -- in which case, it must be removed. Facts don't need version numbers.

In a more sophisticated star schema, your facts have measures. Price, volume sold, cost, profit, etc. These are also recorded in the fact table. These pieces of information may vary with time. Therefore, you almost always have a time dimension for each fact.

Therefore, Time is a dimension of the book-author fact. If this fact can change, the applicable time period is recorded as part of the fact.

The time dimension isn't quite the same thing as a version number. It's slightly simpler. It states that at a given point in time, the fact was true. If the fact changes, you append a new fact with a different timestamp.

You can, given a particular point in time, locate the relevant facts and associated dimension values.

S.Lott
Good show. Thank you. The SCD literature is helpful.
JoshJordan
Thanks. I ought to reconsider why/whether I want to have a separate history table for each table, instead of keeping each table's old data in the table itself.
ChrisW
@ChrisW: SCD Design is hard. It depends on the kinds of queries you'll get. Do folks do "counter-factual" ("what-if") queries? "What if these sales numbers were reported by last years region definition?" In this case, you might be joining against historical dimension rows. If you do this rarely, a separate history table doesn't hurt. If you do this frequently, a separate history table may be too complex.
S.Lott
One thing is that having historic data in the same table makes it harder (or less intuitive) to implement referential integrity. For example, there might be a requirement for a constraint that each Book has a corresponding Author. Nomally, you could do this with a Foreign Key. If the author table contains deleted authors, though, then a foreign key isn't a sufficient contraint (because it would allow you to insert a new book which references a deleted author).
ChrisW
Instead of a simple foreign key I'd guess you'd have to define it as a (more complicated) CHECK constraint, which should check for records with a given AuthorId but also with the required AuthorStatus value.
ChrisW
+1  A: 

Sounds almost like an ideal use case for CouchDB. With this document-oriented database you get revisions for free (each document is automatically revisioned unless you configure your database different).

It is also possible to have m:n relations between documents. However, migrating to CouchDB is quite a big step and I don't know how well it is accessible from ASP.NET. But reading some introductory tutorials can't hurt.

Franz