views:

59

answers:

3

Introduction

I have the following code which checks to see if a similar model exists in the database, and if it does not it creates the new model:

class BookProfile():

    # ...

    def save(self, *args, **kwargs):

        uniqueConstraint = {'book_instance': self.book_instance, 'collection': self.collection}

        # Test for other objects with identical values
        profiles = BookProfile.objects.filter(Q(**uniqueConstraint) & ~Q(pk=self.pk))

        # If none are found create the object, else fail.
        if len(profiles) == 0:
            super(BookProfile, self).save(*args, **kwargs)
        else:
            raise ValidationError('A Book Profile for that book instance in that collection already exists')

I first build my constraints, then search for a model with those values which I am enforcing must be unique Q(**uniqueConstraint). In addition I ensure that if the save method is updating and not inserting, that we do not find this object when looking for other similar objects ~Q(pk=self.pk).

I should mention that I ham implementing soft delete (with a modified objects manager which only shows non-deleted objects) which is why I must check for myself rather then relying on unique_together errors.

Problem

Right thats the introduction out of the way. My problem is that when multiple identical objects are saved in quick (or as near as simultaneous) succession, sometimes both get added even though the first being added should prevent the second.

I have tested the code in the shell and it succeeds every time I run it. Thus my assumption is if say we have two objects being added Object A and Object B. Object A runs its check upon save() being called. Then the process saving Object B gets some time on the processor. Object B runs that same test, but Object A has not yet been added so Object B is added to the database. Then Object A regains control of the processor, and has allready run its test, even though identical Object B is in the database, it adds it regardless.

My Thoughts

The reason I fear multiprogramming could be involved is that each Object A and Object is being added through an API save view, so a request to the view is made for each save, thus not a single request with multiple sequential saves on objects.

It might be the case that Apache is creating a process for each request, and thus causing the problems I think I am seeing. As you would expect, the problem only occurs sometimes, which is characteristic of multiprogramming or multiprocessing errors.

If this is the case, is there a way to make the test and set parts of the save() method a critical section, so that a process switch cannot happen between the test and the set?

A: 

You need to use synchronization on the save method. I haven't tried this yet, but here's a decorator that can be used to do so.

Luiz C.
Will that work as the save method is not actually a threaded object?
Marcus Whybrow
I have tried it, and it just seams to break the save method.
Marcus Whybrow
+1  A: 

Based on what you've described, it seems reasonable to assume that multiple Apache processes are a source of problems. Are you able to replicate if you limit Apache to a single worker process?

Maybe the suggestions in this thread will help: http://stackoverflow.com/questions/1123200/how-to-lock-a-critical-section-in-django

An alternative approach could be utilizing a queue. You'd just stick your objects to be saved into the queue and have another process doing the actual save. That way you could guarantee that objects were processed sequentially. This wouldn't work well if your application depends on having the object saved by the time the response is returned unless you also had the request processes wait on the result (watching a finished queue for example).

Updated You may find this info useful. Mr. Dumpleton does a much better job of laying out the considerations than I could attempt to summarize here:

http://code.google.com/p/modwsgi/wiki/ProcessesAndThreading

http://code.google.com/p/modwsgi/wiki/ConfigurationGuidelines especially the Defining Process Groups section.

http://code.google.com/p/modwsgi/wiki/QuickConfigurationGuide Delegation to Daemon Process section

http://code.google.com/p/modwsgi/wiki/IntegrationWithDjango Find the section of text toward the bottom of the page that begins with:

Now, traditional wisdom in respect of Django has been that it should perferably only be used on single threaded servers. This would mean for Apache using the single threaded 'prefork' MPM on UNIX systems and avoiding the multithreaded 'worker' MPM.

and read until the end of the page.

Brian Luft
That solution seams to be more for writing to a file using Django. I think my solution probably does the job.
Marcus Whybrow
Yes, it is a solution using file locking so that multiple processes can synchronize based on the availability of the file lock. It really has little to do with "writing to a file" other than using a write operation to create the file initially. Your solution should work as long as your web server only ever uses a single process to serve requests. If you ever change that configuration your solution won't handle the cases where requests being handled in different processes execute your code.
Brian Luft
Is that apaches default configuration, and does that mean there is only ever a single process which handles all requests, or that there is one process for each request?
Marcus Whybrow
I've updated my original answer with links that should help you understand the various configurations under which both Apache and mod_wsgi would utilize multiple processes and/or threads.
Brian Luft
Hmmm, it is important to read the parts following that section starting 'Now, traditional wisdom in respect of Django ...' as that was only a lead in and reading in isolation may give wrong idea. Also prefork and embedded mode can be a problem in itself, read 'http://blog.dscpl.com.au/2009/03/load-spikes-and-excessive-memory-usage.html'. Otherwise a good set of references for further reading.
Graham Dumpleton
@Graham Poor grammar on my part, that's what I meant to say. Edited. Thanks!
Brian Luft
Thanks for all the resources mentioned, I have read a few already and some I have not. I have implemented the the File lock approach suggested by the author of the mentioned StackOverflow question, I have edited my suggestion to reflect this change.
Marcus Whybrow
+1  A: 

I have found a solution that I think might work:

import threading

def save(self, *args, **kwargs):

    lock = threading.Lock()
    lock.acquire()
    try:
        # Test and Set Code
    finally:
        lock.release()

It doesn't seam to break the save method like that decorator and thus far I have not seen the error again.

Unless anyone can say that this is not a correct solution, I think this works.

Update

The accepted answer was the inspiration for this change.

I seams I was under the impressions that locks were some sort of special voodoo that were exempt by normal logic. Here the lock = threading.Lock() is run each time, thus instantiating a new unlocked lock which may always be acquired.

I needed a single central lock for the purpose, but were could that go unless I had a thread running all the time holding the lock? The answer seamed to be to use file locks explained in this answer to the StackOverflow question mentioned in the accepted answer.

The following is that solution modified to suit my situation:

The Code

Th following is my modified DjangoLock. I wished to keep locks relative to the Django root, to do this I put a custom variable into the settings.py file.

# locks.py

import os
import fcntl
from django.conf import settings

class DjangoLock:

    def __init__(self, filename):
        self.filename = os.path.join(settings.LOCK_DIR, filename)
        self.handle = open(self.filename, 'w')

    def acquire(self):
        fcntl.flock(self.handle, fcntl.LOCK_EX)

    def release(self):
        fcntl.flock(self.handle, fcntl.LOCK_UN)

    def __del__(self):
        self.handle.close()

And now the additional LOCK_DIR settings variable:

# settings.py

import os
PATH = os.path.abspath(os.path.dirname(__file__))

# ...

LOCK_DIR = os.path.join(PATH, 'locks')

That will now put locks in a folder named locks relative to the root of the Django project. Just make sure you give apache write access, in my case:

sudo chown www-data locks

And finally the usage is much the same as before:

import locks

def save(self, *args, **kwargs):

    lock = locks.DjangoLock('ClassName')
    lock.acquire()
    try:
        # Test and Set Code
    finally:
        lock.release()

This is now the implementation I am using and it seams to be working really well. Thanks to all who have contributed to the process of arriving at this end.

Marcus Whybrow
That seems to be correct. Here's a place that explains the limitations of using locks: http://effbot.org/zone/thread-synchronization.htm
Luiz C.
Assuming that your web server is configured to handle all Django requests within a single process.
Brian Luft
Yes http://effbot.org/zone/thread-synchronization.htm is where I found my solution, and I use the WSGIScriptAlias directive in the apache config. Would using WSGI ensure each django request is a single process?
Marcus Whybrow
WSGI is just a protocol spec, so I assume you mean mod_wsgi. To guarantee single process with mod_wsgi you need to use daemon mode, which means also using the WSGIDaemonProcess directive and NOT have a processes= argument. In other words just have threads=<some number>
Van Gale
Yes you are correct about mod_wsgi, I am thinking about adding the following directives: `WSGIDaemonProcess example.com threads=15` and `SGIProcessGroup example.com` (obviously replaced with my domain), would this do the trick?
Marcus Whybrow
Yes, that should do it, and as a preemptive cautionary note, don't try and get tricky and use "processes=1" unless you understand the implications... for reasons documented on the wiki.
Van Gale
Yes I read that in the wiki, mentioning `processes=` at all makes it assume something you don't want with a single process. Thanks for the help.
Marcus Whybrow