views:

190

answers:

6

What algorithms and processes are involved in storing revision changes like stackoverflow and wikipedia do?

Is only one copy of the message kept? And if so is it only the latest copy? Then only changes to go back to the previous version(s) are stored from there? (This would make for a faster display of the main message). Or are complete messages stored? And if so is the compare done between these on each display?

What algorithms are best used to determine the exact changes in the message? How is this data stored in a database?

If anyone knows exactly what wikipedia or stackoverlfow does I'd love to know.

+1  A: 

Usually messages are stored as complete snapshots. Previous versions are disabled, and the most recent is displayed. There may be optimizations used like caching which version is the most recent.

John Millikin
+1  A: 

The longest common substring algorithm can be used to detect differences between versions, but it is limited. For example, it does not detect the moving around of text as such, but it would see this as unrelated removals and insertions.

I suppose that websites normally store the latest copy in full, and apply reverse diffs from there. This is also the way CVS works, but Subversion uses forward diffs, which results in slower checkouts.

To store this in a database, one could maintain a main table with the latest versions, and have a separate table with the reverse differences. This table would have rows in the format (article_id, revision_id, differences).

Thomas
A: 

Typical revision changes are stored using a delta algorithm, so the only data stored are the changes in each revision in relation to the original. I am unsure of wikipedia or stackoverflow how they have it implemented.

mattlant
A: 

I would use the following technique:

  • Store the current message as complete text.
  • Store the history using the delta algorithm.

This will keep your performance good with regular display, while keeping the storage to a minimum for the history.

Davy Landman
+4  A: 

Mediawiki (the sotware for wikipedia) stores full text for all revision see the database schema. Each entry in the text table in Mediawiki has flags that tells if the content has been e.g. gziped, using a standard compression is often the sanest option.

I can't tell you how to do the diffs algorithmically, but what ever algorithm you use you should do it from two full versions of the text. That is fetch the complete version of old and new object from database then do the diff. This makes it possible to easily change the diffing algorithm.

Git is a great example of a Unix application that can do very cheap (storage and speedwise) delta storage. There are wikis that can use git e.g. ikiwiki, but I'm guessing you want to do it with a database.

Erik Johansson
Wish I could accept 2 answers
Brian R. Bondy
A: 

The accepted answer is pretty bad.. Problems:

  • slow
  • non future proof
  • complicated
Erik Johansson
Even if you store a complete copy of each message, this shows you how to do the diff once displaying the "differences page".
Brian R. Bondy