ansaurus

Question

Answer 1

+1 A:

For managing the diffs, you would probably want to investigate Python's difflib.

Regarding atomicity, I would probably handle it the same as the Wikis (Trac, etc.). If the content has changed since the user last retrieved it, request that they override with the new version. If you're storing the text and diffs in the same record, it shouldn't be difficult to avoid database race conditions using the techniques in the links you posted.

Jeff Bauer 2009-01-07 18:18:57

Difflib is great, thank you. I still haven't worked out the atomicity, but I think it's doable.

Brian M. Hunt 2009-01-08 16:47:14

Answer 2

+2 A:

The storage issue: I think you should only store the diffs of two consecutive valid versions of the document. As you point out, the problem becomes getting a valid version when concurrent edits take place.

The concurrency issue:

Could you avoid them all together like Jeff suggests or by locking the document?
If not, I think you're ultimately in the paradigm of online collaborative real-time editors like Google Docs.

To get an illustrated view of the can of worms you are opening catch this google tech-talk at 9m21s (it's about Eclipse's collaborative real-time editing)

Alternatively, there are a couple of patents that detail ways of dealing with these concurrences on the Wikipedia article on collaborative real-time editors.

Ivan 2009-01-07 23:24:24

Very helpful links, thank you. Very interesting problem. I'm perhaps looking for middle-ground: concurrent editing without the complexity of collaborative real-time editing.

Brian M. Hunt 2009-01-08 16:46:41

Answer 3

+1 A:

Your auto save, I assume, saves a draft version before the user actually presses the save button, right?

If so, you don't have to keep the draft saves, simply dispose them after the user decideds to save for real, and only keep history of the real/explicit saves.

hasen j 2009-01-08 00:38:04

Good suggestion. I like the idea of keeping an implicit history - so you can go back and go "oh, right". It comes at a price, though. :)

Brian M. Hunt 2009-01-08 16:48:03

Answer 4

+2 A:

Here's what I've done to save an object's history:

For Django application History:

history/__init.py:**

"""
history/__init__.py
"""
from django.core import serializers
from django.utils import simplejson as json
from django.db.models.signals import pre_save, post_save

# from http://code.google.com/p/google-diff-match-patch/
from contrib.diff_match_patch import diff_match_patch

from history.models import History

def register_history(M):
  """
  Register Django model M for keeping its history

  e.g. register_history(Document) - every time Document is saved,
  its history (i.e. the differences) is saved.
  """
  pre_save.connect(_pre_handler, sender=M)
  post_save.connect(_post_handler, sender=M)

def _pre_handler(signal, sender, instance, **kwargs):
  """
  Save objects that have been changed.
  """
  if not instance.pk:
    return

  # there must be a before, if there's a pk, since
  # this is before the saving of this object.
  before = sender.objects.get(pk=instance.pk)

  _save_history(instance, _serialize(before).get('fields'))

def _post_handler(signal, sender, instance, created, **kwargs):
  """
  Save objects that are being created (otherwise we wouldn't have a pk!)
  """
  if not created:
     return

  _save_history(instance, {})

def _serialize(instance):
   """
   Given a Django model instance, return it as serialized data
   """
   return serializers.serialize("python", [instance])[0]

def _save_history(instance, before):
  """
  Save two serialized objects
  """
  after = _serialize(instance).get('fields',{})

  # All fields.
  fields = set.union(set(before.keys()),set(after.keys()))

  dmp = diff_match_patch()

  diff = {}

  for field in fields:
    field_before = before.get(field,False)
    field_after = after.get(field,False)

    if field_before != field_after:
      if isinstance(field_before, unicode) or isinstance(field_before, str):
      # a patch
        diff[field] = dmp.diff_main(field_before,field_after)
      else:
        diff[field] = field_before

  history = History(history_for=instance, diff=json.dumps(diff))
  history.save()

history/models.py

"""
history/models.py
"""

from django.db import models

from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import generic

from contrib import diff_match_patch as diff

class History(models.Model):
     """
     Retain the history of generic objects, e.g. documents, people, etc..
  """

  content_type = models.ForeignKey(ContentType, null=True)

  object_id = models.PositiveIntegerField(null=True)

  history_for = generic.GenericForeignKey('content_type', 'object_id')

  diff = models.TextField()

  def __unicode__(self):
       return "<History (%s:%d):%d>" % (self.content_type, self. object_id, self.pk)

Hope that helps someone, and comments would be appreciated.

Note that this does not address the race condition of my greatest concern. If, in _pre_handler "before = sender.objects.get(pk=instance.pk)" is called before another instance saves, but after that other instance has updated the history, and the present instance saves first, there will be an 'broken history' (i.e. out-of-order). Thankfully diff_match_patch attempts to gracefully handle "non-fatal" breaks, but there's no guarantee of success.

One solution is atomicity. I'm not sure how to go about making the above race condition (i.e. _pre_handler) an atomic operation across all instances of Django, though. A HistoryLock table, or a shared hash in memory (memcached?) would be fine - suggestions?

The other solution, as mentioned, is a reconciliation algorithm. However, concurrent saves may have "genuine" conflicts and require user intervention to determine the correct reconciliation.

Obviously, piecing the history back together isn't part of the above snippets.

Brian M. Hunt 2009-01-11 19:13:32

Answer 5

+1 A:

I've since discovered django-reversion, also, which seems to work well and be actively maintained, though it doesn't do diff's to efficiently store small diffs to large pieces of text.

Brian M. Hunt 2009-02-03 15:30:09

ansaurus

tags:

views:

answers:

How to do text full history in Django?

related questions