storing revision changes of a message

views:

190

answers:

+2 Q:

storing revision changes of a message

What algorithms and processes are involved in storing revision changes like stackoverflow and wikipedia do?

Is only one copy of the message kept? And if so is it only the latest copy? Then only changes to go back to the previous version(s) are stored from there? (This would make for a faster display of the main message). Or are complete messages stored? And if so is the compare done between these on each display?

What algorithms are best used to determine the exact changes in the message? How is this data stored in a database?

If anyone knows exactly what wikipedia or stackoverlfow does I'd love to know.

+1 A:

Usually messages are stored as complete snapshots. Previous versions are disabled, and the most recent is displayed. There may be optimizations used like caching which version is the most recent.

John Millikin 2008-09-17 04:59:53

+1 A:

The longest common substring algorithm can be used to detect differences between versions, but it is limited. For example, it does not detect the moving around of text as such, but it would see this as unrelated removals and insertions.

I suppose that websites normally store the latest copy in full, and apply reverse diffs from there. This is also the way CVS works, but Subversion uses forward diffs, which results in slower checkouts.

To store this in a database, one could maintain a main table with the latest versions, and have a separate table with the reverse differences. This table would have rows in the format (article_id, revision_id, differences).

Thomas 2008-09-17 05:05:01

Typical revision changes are stored using a delta algorithm, so the only data stored are the changes in each revision in relation to the original. I am unsure of wikipedia or stackoverflow how they have it implemented.

mattlant 2008-09-17 05:09:30

I would use the following technique:

Store the current message as complete text.
Store the history using the delta algorithm.

This will keep your performance good with regular display, while keeping the storage to a minimum for the history.

Davy Landman 2008-09-17 07:39:14

+4 A:

Mediawiki (the sotware for wikipedia) stores full text for all revision see the database schema. Each entry in the text table in Mediawiki has flags that tells if the content has been e.g. gziped, using a standard compression is often the sanest option.

I can't tell you how to do the diffs algorithmically, but what ever algorithm you use you should do it from two full versions of the text. That is fetch the complete version of old and new object from database then do the diff. This makes it possible to easily change the diffing algorithm.

Git is a great example of a Unix application that can do very cheap (storage and speedwise) delta storage. There are wikis that can use git e.g. ikiwiki, but I'm guessing you want to do it with a database.

Erik Johansson 2008-09-17 10:30:14

Wish I could accept 2 answers

Brian R. Bondy 2008-09-20 07:47:51

The accepted answer is pretty bad.. Problems:

slow
non future proof
complicated

Erik Johansson 2008-09-23 12:24:11

Even if you store a complete copy of each message, this shows you how to do the diff once displaying the "differences page".

Brian R. Bondy 2008-09-23 13:27:47

ansaurus

tags:

views:

answers:

storing revision changes of a message

related questions