views:

981

answers:

5

I'd like to have the full history of a large text field edited by users, stored using Django.

I've seen the projects:

I've a special use-case that probably falls outside the scope of what these projects provide. Further, I'm wary of how well documented, tested and updated these projects are. In any event, here's the problem I face:

I've a model, likeso:

from django.db import models

class Document(models.Model):
   text_field = models.TextField()

This text field may be large - over 40k - and I would like to have an autosave feature that saves the field every 30 seconds or so. This could make the database unwieldly large, obviously, if there are a lot of saves at 40k each (probably still 10k if zipped). The best solution I can think of is to keep a difference between the most recent saved version and the new version.

However, I'm concerned about race conditions involving parallel updates. There are two distinct race conditions that come to mind (the second much more serious than the first):

  1. HTTP transaction race condition: User A and User B request document X0, and make changes individually, producing Xa and Xb. Xa is saved, the difference between X0 and Xa being "Xa-0" ("a less not"), Xa now being stored as the official version in the database. If Xb subsequently saves, it overwrite Xa, the diff being Xb-a ("b less a").

    While not ideal, I'm not overly concerned with this behaviour. The documents are overwriting each other, and users A and B may have been unaware of each other (each having started with document X0), but the history retains integrity.

  2. Database read/update race condition: The problematic race condition is when Xa and Xb save at the same time over X0. There will be (pseudo-)code something like:

     def save_history(orig_doc, new_doc):
         text_field_diff = diff(orig_doc.text_field, new_doc.text_field)
         save_diff(text_field_diff)
    

    If Xa and Xb both read X0 from the database (i.e. orig_doc is X0), their differences will become Xa-0 and Xb-0 (as opposed to the serialized Xa-0 then Xb-a, or equivalently Xb-0 then Xa-b). When you try to patch the diffs together to produce the history, it will fail on either patch Xa-0 or Xb-0 (which both apply to X0). The integrity of the history has been compromised (or has it?).

    One possible solution is an automatic reconciliation algorithm, that detects these problems ex-post. If rebuilding the history fails, one may assume that a race condition has occurred, and so apply the failed patch to prior versions of the history until it succeeds.

I'd be delighted to have some feedback and suggestions on how to go about tackling this problem.

Incidentally, insofar as it's a useful way out, I've noted that Django atomicity is discussed here:

Thank you kindly.

+1  A: 

For managing the diffs, you would probably want to investigate Python's difflib.

Regarding atomicity, I would probably handle it the same as the Wikis (Trac, etc.). If the content has changed since the user last retrieved it, request that they override with the new version. If you're storing the text and diffs in the same record, it shouldn't be difficult to avoid database race conditions using the techniques in the links you posted.

Jeff Bauer
Difflib is great, thank you. I still haven't worked out the atomicity, but I think it's doable.
Brian M. Hunt
+2  A: 

The storage issue: I think you should only store the diffs of two consecutive valid versions of the document. As you point out, the problem becomes getting a valid version when concurrent edits take place.

The concurrency issue:

  1. Could you avoid them all together like Jeff suggests or by locking the document?
  2. If not, I think you're ultimately in the paradigm of online collaborative real-time editors like Google Docs.

To get an illustrated view of the can of worms you are opening catch this google tech-talk at 9m21s (it's about Eclipse's collaborative real-time editing)

Alternatively, there are a couple of patents that detail ways of dealing with these concurrences on the Wikipedia article on collaborative real-time editors.

Ivan
Very helpful links, thank you. Very interesting problem. I'm perhaps looking for middle-ground: concurrent editing without the complexity of collaborative real-time editing.
Brian M. Hunt
+1  A: 

Your auto save, I assume, saves a draft version before the user actually presses the save button, right?

If so, you don't have to keep the draft saves, simply dispose them after the user decideds to save for real, and only keep history of the real/explicit saves.

hasen j
Good suggestion. I like the idea of keeping an implicit history - so you can go back and go "oh, right". It comes at a price, though. :)
Brian M. Hunt
+2  A: 

Here's what I've done to save an object's history:

For Django application History:

history/__init.py:**

"""
history/__init__.py
"""
from django.core import serializers
from django.utils import simplejson as json
from django.db.models.signals import pre_save, post_save

# from http://code.google.com/p/google-diff-match-patch/
from contrib.diff_match_patch import diff_match_patch

from history.models import History

def register_history(M):
  """
  Register Django model M for keeping its history

  e.g. register_history(Document) - every time Document is saved,
  its history (i.e. the differences) is saved.
  """
  pre_save.connect(_pre_handler, sender=M)
  post_save.connect(_post_handler, sender=M)

def _pre_handler(signal, sender, instance, **kwargs):
  """
  Save objects that have been changed.
  """
  if not instance.pk:
    return

  # there must be a before, if there's a pk, since
  # this is before the saving of this object.
  before = sender.objects.get(pk=instance.pk)

  _save_history(instance, _serialize(before).get('fields'))

def _post_handler(signal, sender, instance, created, **kwargs):
  """
  Save objects that are being created (otherwise we wouldn't have a pk!)
  """
  if not created:
     return

  _save_history(instance, {})

def _serialize(instance):
   """
   Given a Django model instance, return it as serialized data
   """
   return serializers.serialize("python", [instance])[0]

def _save_history(instance, before):
  """
  Save two serialized objects
  """
  after = _serialize(instance).get('fields',{})

  # All fields.
  fields = set.union(set(before.keys()),set(after.keys()))

  dmp = diff_match_patch()

  diff = {}

  for field in fields:
    field_before = before.get(field,False)
    field_after = after.get(field,False)

    if field_before != field_after:
      if isinstance(field_before, unicode) or isinstance(field_before, str):
      # a patch
        diff[field] = dmp.diff_main(field_before,field_after)
      else:
        diff[field] = field_before

  history = History(history_for=instance, diff=json.dumps(diff))
  history.save()

history/models.py

"""
history/models.py
"""

from django.db import models

from django.contrib.contenttypes.models import ContentType
from django.contrib.contenttypes import generic

from contrib import diff_match_patch as diff

class History(models.Model):
     """
     Retain the history of generic objects, e.g. documents, people, etc..
  """

  content_type = models.ForeignKey(ContentType, null=True)

  object_id = models.PositiveIntegerField(null=True)

  history_for = generic.GenericForeignKey('content_type', 'object_id')

  diff = models.TextField()

  def __unicode__(self):
       return "<History (%s:%d):%d>" % (self.content_type, self. object_id, self.pk)

Hope that helps someone, and comments would be appreciated.

Note that this does not address the race condition of my greatest concern. If, in _pre_handler "before = sender.objects.get(pk=instance.pk)" is called before another instance saves, but after that other instance has updated the history, and the present instance saves first, there will be an 'broken history' (i.e. out-of-order). Thankfully diff_match_patch attempts to gracefully handle "non-fatal" breaks, but there's no guarantee of success.

One solution is atomicity. I'm not sure how to go about making the above race condition (i.e. _pre_handler) an atomic operation across all instances of Django, though. A HistoryLock table, or a shared hash in memory (memcached?) would be fine - suggestions?

The other solution, as mentioned, is a reconciliation algorithm. However, concurrent saves may have "genuine" conflicts and require user intervention to determine the correct reconciliation.

Obviously, piecing the history back together isn't part of the above snippets.

Brian M. Hunt
+1  A: 

I've since discovered django-reversion, also, which seems to work well and be actively maintained, though it doesn't do diff's to efficiently store small diffs to large pieces of text.

Brian M. Hunt