I'd like to have the full history of a large text field edited by users, stored using Django.
I've seen the projects:
I've a special use-case that probably falls outside the scope of what these projects provide. Further, I'm wary of how well documented, tested and updated these projects are. In any event, here's the problem I face:
I've a model, likeso:
from django.db import models
class Document(models.Model):
text_field = models.TextField()
This text field may be large - over 40k - and I would like to have an autosave feature that saves the field every 30 seconds or so. This could make the database unwieldly large, obviously, if there are a lot of saves at 40k each (probably still 10k if zipped). The best solution I can think of is to keep a difference between the most recent saved version and the new version.
However, I'm concerned about race conditions involving parallel updates. There are two distinct race conditions that come to mind (the second much more serious than the first):
HTTP transaction race condition: User A and User B request document X0, and make changes individually, producing Xa and Xb. Xa is saved, the difference between X0 and Xa being "Xa-0" ("a less not"), Xa now being stored as the official version in the database. If Xb subsequently saves, it overwrite Xa, the diff being Xb-a ("b less a").
While not ideal, I'm not overly concerned with this behaviour. The documents are overwriting each other, and users A and B may have been unaware of each other (each having started with document X0), but the history retains integrity.
Database read/update race condition: The problematic race condition is when Xa and Xb save at the same time over X0. There will be (pseudo-)code something like:
def save_history(orig_doc, new_doc): text_field_diff = diff(orig_doc.text_field, new_doc.text_field) save_diff(text_field_diff)
If Xa and Xb both read X0 from the database (i.e. orig_doc is X0), their differences will become Xa-0 and Xb-0 (as opposed to the serialized Xa-0 then Xb-a, or equivalently Xb-0 then Xa-b). When you try to patch the diffs together to produce the history, it will fail on either patch Xa-0 or Xb-0 (which both apply to X0). The integrity of the history has been compromised (or has it?).
One possible solution is an automatic reconciliation algorithm, that detects these problems ex-post. If rebuilding the history fails, one may assume that a race condition has occurred, and so apply the failed patch to prior versions of the history until it succeeds.
I'd be delighted to have some feedback and suggestions on how to go about tackling this problem.
Incidentally, insofar as it's a useful way out, I've noted that Django atomicity is discussed here:
- Django: How can I protect against concurrent modification of data base entries, and here:
- Atomic operations in Django?
Thank you kindly.