tags:

views:

152

answers:

4

I am trying to figure out how to setup a revisions system for posts and other content. I figured that would mean it would need to work with a basic belongs_to/has_one/has_many/has_many_though ORM (any good ORM should support this).

I was thinking a that I could have some tables like (with matching models)

[[POST]] (has_many (text) through (revisions)
id
title

[[Revisions]] (belongs_to posts/text)
id
post_id
text_id
date

[[TEXT]]
id
body
user_id

Where I could join THROUGH the revisions table to get the latest TEXT body. But I'm kind of foggy on how it will all work. Has anyone setup something like this?

Basically, I need to be able to load an article and request the latest content entry.

// Get the post row
$post = new Model_Post($id);
// Get the latest revision (JOIN through revisions to TEXT) and print that body.
$post->text->body;

Having the ability to shuffle back in time to previous revisions and removing revisions would also be a big help.

At any rate, these are just ideas of how I think that some kind of history tracking would work. I'm open to any form of tracking I just want to know what the best-practice is.

:EDIT:

It seems that moving forward, two tables seems to make the most sense. Since I plan to store two copies of text this will also help to save space. The first table posts will store the data of the current revision for fast reads without any joins. The posts body will be the value of the matching revision's text field - but processed through markdown/bbcode/tidy/etc. This will allow me to retain the original text (for the next edit) without having to store that text twice in one revision row (or having to re-parse it each time I display it).

So fetching will be be ORM friendly. Then for creates/updates I will have to handle revisions separately and then just update the post object with the new current revision values.

  CREATE TABLE IF NOT EXISTS `posts` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `published` tinyint(1) unsigned DEFAULT NULL,
  `allow_comments` tinyint(1) unsigned DEFAULT NULL,
  `user_id` int(11) NOT NULL,
  `title` varchar(100) NOT NULL,
  `body` text NOT NULL,
  `created` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `user_id` (`user_id`),
  KEY `published` (`published`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;

CREATE TABLE IF NOT EXISTS `postsrevisions` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `post_id` int(10) unsigned NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `is_current` tinyint(1) unsigned DEFAULT NULL,
  `date` datetime NOT NULL,
  `title` varchar(100) NOT NULL,
  `text` text NOT NULL,
  `image` varchar(200) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `post_id` (`post_id`),
  KEY `user_id` (`user_id`),
  KEY `is_current` (`is_current`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
A: 

You'd probably be better off in this case to put a CurrentTextID on your Post table to avoid having to figure out which revision is current (an alternative would be a flag on Revision, but I think a CurrentTextID on the post will give you easier queries).

With the CurrentTextID on the Post, your ORM should place a single property (CurrentText) on your Post class which would allow you to access the current text with essentially the statement you provided.

Your ORM should also give you some way to load the Revisions based on the Post; If you want more details about that then you should include information about which ORM you are using and how you have it configured.

Chris Shaffer
Well, I'm not sure that would help speed things up much since I could also just `ORDER BY date DESC, LIMIT 1` when fetching the post text through the revisions table.
Xeoncross
Using LIMIT doesn't work as easily if you grab more than one post at a time (eg, if you want to show the most recent 10 posts). Anyway, if that is not what you are looking for, then what?
Chris Shaffer
I was speaking about using `$post->text->body` and having it auto-create the SQL to fetch the latest revision using `ORDER BY date` clause. At any rate, I don't really care how it's done, I just want someone who has already setup something a revisions system to tell me how they did it (i.e. tables used, model relations, etc).
Xeoncross
A: 

I think two tables would suffice here. A post table and it's revisions. If you're not worried about duplicating data, a single table (de-normalized) could also work.

digitalsanctum
Well, I'm trying to follow proper 3NF+ practice by using a through table. http://en.wikipedia.org/wiki/Database_normalizationHowever, your right that it would be easier just to use two tables - and in my case I'm not sure there would be an benefit to the "revisions" table anyway.
Xeoncross
Then you have to move data between the tables?
AnApprentice
+1  A: 

Your Revisions table as you have shown it models a many-to-many relationship between Posts and Text. This is probably not what you want, unless a given row in Text may provide the content for multiple rows in Posts. This is not how most CMS architectures work.

You certainly don't need three tables. I have no idea why you think this is needed for 3NF. The point of 3NF is that an attribute should not depend on a non-key attribute, it doesn't say you should split into multiple tables needlessly.

So you might only need a one-to-many relationship between two tables: Posts and Revisions. That is, for each post, there can be multiple revisions, but a given revision applies to only one post. Others have suggested two alternatives for finding the current post:

  • A flag column in Revisions to note the current revision. Changing the current revision is as simple as changing the flag to true in the desired revision and to false to the formerly current revision.

  • A foreign key in Posts to the revision that is current for the given post. This is even simpler, because you can change the current revision in one update instead of two. But circular foreign key references can cause problems vis-a-vis backup & restore, cascading updates, etc.

You could even implement the revision system using a single table:

CREATE TABLE PostRevisions (
  post_revision_id SERIAL PRIMARY KEY,
  post_id INT NOT NULL,
  is_current TINYINT NULL,
  date DATE,
  title VARCHAR(80) NOT NULL,
  text TEXT NOT NULL,
  UNIQUE KEY (post_id, is_current)
);

I'm not sure it's duplication to store the title with each revision, because the title could be revised as much as the text, couldn't it?

The column is_current should be either 1 or NULL. A unique constraint doesn't count NULLs, so you can have only one row where is_current is 1 and an unlimited number of rows where it's NULL.

This does require updating two rows to make a revision current, but you gain some simplicity by reducing the model to a single table. This is a great advantage when you're using an ORM.

You can create a view to simplify the common case of querying current posts:

CREATE VIEW Posts AS SELECT * FROM PostRevisions WHERE is_current = 1;

update: Re your updated question: I agree that proper relational design would encourage two tables so that you could make a few attributes of a Post invariant for all that post's revisions. But most ORM tools assume an entity exists in a single table, and ORM's are clumsy at joining rows from multiple tables to constitute a given entity. So I would say if using an ORM is a priority, you should store the posts and revisions in a single table. Sacrifice a little bit of relational correctness to support the assumptions of the ORM paradigm.

Another suggestion is to consider Dimensional Modeling. This is a school of database design to support OLAP and data warehousing. It uses denormalization judiciously, so you can usually organize data in a Star Schema. The main entity (the "Fact Table") is represented by a single table, so this would be a win for an ORM-centric application design.

Bill Karwin
I think I would rather stick with two tables. I think that the best method to handle this would be to skip the `is_current` field and just include a date with each entry in the text table so I can order by revision date. Or wait, that won't work because I will I might want to keep a revision and revert back (which needs a `is_current` key). Perhaps one table would be better since the author, title, text, and date might all change each revision. Then again, things like whether comments/pings are open might need to stay the same.
Xeoncross
Right, the attribute of `is_current` is independent from the `date`. In relational database design, it's worthwhile to give store each logical attribute in a separate column. Don't try to make columns do double-duty.
Bill Karwin
How would you work with the primary key if you were only using one table? I guess that rather than relying on the table to increase the row ID - you would have to fetch the last post ID of the latest post row...
Xeoncross
Good point. Two options: You could make `postid` the serial column, and `revision` would start from 1 for each postid and you'd increment it manually. Or you could make `revisionid` serial and `postid` a GUID to avoid race conditions.
Bill Karwin
Take a look at my updated solution above. I think that it will be ORM friendly while still allowing neat things that two tables can offer.
Xeoncross
Looks fine to me. I'd use actual foreign key constraints, and I'd name the primary keys something more descriptive than the ubiquitous `id`.
Bill Karwin
A: 

For anyone interested, here is how wordpress handles revisions using a single MySQL posts table.

CREATE TABLE IF NOT EXISTS `wp_posts` (
  `ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `post_author` bigint(20) unsigned NOT NULL DEFAULT '0',
  `post_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `post_date_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `post_content` longtext NOT NULL,
  `post_title` text NOT NULL,
  `post_excerpt` text NOT NULL,
  `post_status` varchar(20) NOT NULL DEFAULT 'publish',
  `comment_status` varchar(20) NOT NULL DEFAULT 'open',
  `ping_status` varchar(20) NOT NULL DEFAULT 'open',
  `post_password` varchar(20) NOT NULL DEFAULT '',
  `post_name` varchar(200) NOT NULL DEFAULT '',
  `to_ping` text NOT NULL,
  `pinged` text NOT NULL,
  `post_modified` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `post_modified_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `post_content_filtered` text NOT NULL,
  `post_parent` bigint(20) unsigned NOT NULL DEFAULT '0',
  `guid` varchar(255) NOT NULL DEFAULT '',
  `menu_order` int(11) NOT NULL DEFAULT '0',
  `post_type` varchar(20) NOT NULL DEFAULT 'post',
  `post_mime_type` varchar(100) NOT NULL DEFAULT '',
  `comment_count` bigint(20) NOT NULL DEFAULT '0',
  PRIMARY KEY (`ID`),
  KEY `post_name` (`post_name`),
  KEY `type_status_date` (`post_type`,`post_status`,`post_date`,`ID`),
  KEY `post_parent` (`post_parent`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 ;
Xeoncross
Wow - why in the world would they use VARCHAR(255) for a guid???
Bill Karwin