views:

1420

answers:

7

I'm trying to implement (what I think is) a pretty simple data model for a counter:

class VisitorDayTypeCounter(models.Model):
    visitType = models.CharField(max_length=60)
    visitDate = models.DateField('Visit Date')
    counter = models.IntegerField()

When someone comes through, it will look for a row that matches the visitType and visitDate; if this row doesn't exist, it will be created with counter=0.

Then we increment the counter and save.

My concern is that this process is totally a race. Two requests could simultaneously check to see if the entity is there, and both of them could create it. Between reading the counter and saving the result, another request could come through and increment it (resulting in a lost count).

So far I haven't really found a good way around this, either in the Django documentation or in the tutorial (in fact, it looks like the tutorial has a race condition in the Vote part of it).

How do I do this safely?

+4  A: 

Two suggestions:

Add a unique_together to your model, and wrap the creation in an exception handler to catch duplicates:

class VisitorDayTypeCounter(models.Model):
    visitType = models.CharField(max_length=60)
    visitDate = models.DateField('Visit Date')
    counter = models.IntegerField()
    class Meta:
        unique_together = (('visitType', 'visitDate'),)

After this, you could stlll have a minor race condition on the counter's update. If you get enough traffic to be concerned about that, I would suggest looking into transactions for finer grained database control. I don't think the ORM has direct support for locking/synchronization. The transaction documentation is available here.

Daniel
The unique_together certainly makes me feel a bit more comfortable. Likely, there won't ever be enough traffic on this to cause the race to get hit, but since I'm learning Django at the same time, I figured I wanted to "do it right". Thanks for the help!
Tony Arkles
Yeah, I hear you. Maybe someone else around here will be aware of an ORM feature for handling this, or can clear up if some of the built-ins are safe against this scenario.
Daniel
A: 

Why not use the database as the concurrency layer ? Add a primary key or unique constraint the table to visitType and visitDate. If I'm not mistaken, django does not exactly support this in their database Model class or at least I've not seen an example.

Once you've added the constraint/key to the table, then all you have to do is:

  1. check if the row is there. if it is, fetch it.
  2. insert the row. if there's no error you're fine and can move on.
  3. if there's an error (i.e. race condition), re-fetch the row. if there's no row, then it's a genuine error. Otherwise, you're fine.

It's nasty to do it this way, but it seems quick enough and would cover most situations.

Roopinder
It doesn't handle the case where two people go to update the counter at the same time.
Tony Arkles
A: 

Your should use database transactions to avoid this kind of race condition. A transaction lets you perform the whole operation of creating, reading, incrementing and saving the counter on an "all or nothing" base. If anything goes wrong it will roll back the whole thing and you can try again.

Check out the Django docs. There is a transaction middle ware, or you can use decorators around views or methods to create transactions.

Ber
I agree that transactions seem like the answer here, but it's not clear that the functionality will actually solve the increment problem -- the SELECT to get the row will still succeed, and the UPDATE to change the value of the counter will still succeed. If I'm wrong, an example would be awesome.
Tony Arkles
You would need to lock the table during the select to do it this way, and as mentioned by Sam that would drag down your performance. This is the best way if you don't incriment the counter often.
Tom Leys
+9  A: 

If you truly want the counter to be accurate you could use a transaction but the amount of concurrency required will really drag your application and database down under any significant load. Instead think of going with a more messaging style approach and just keep dumping count records into a table for each visit where you'd want to increment the counter. Then when you want the total number of visits do a count on the visits table. You could also have a background process that ran any number of times a day that would sum the visits and then store that in the parent table. To save on space it would also delete any records from the child visits table that it summed up. You'll cut down on your concurrency costs a huge amount if you don't have multiple agents vying for the same resources (the counter).

Sam Corder
Hey good call! I've been doing App Engine work primarily, and I got hung up on "transactions only act on one entry" and "doing aggregate functions is very expensive". That's a really simple way to solve the problem. Thanks!
Tony Arkles
I suppose it does depend on whether the process is going to be read-heavy or write-heavy. Counts are going to be read much more often than they'll be incremented in my system, so for the problem as stated, this might not be the best plan. However, it does solve other concerns I had, so thanks!
Tony Arkles
Depending on how stale the counts can be allowed to be, you could have a background process summing them up every so often. Then you would not be doing the aggregation per request.
Sam Corder
Interesting to know that this journaling method is similar to how transactions are made safe, atomic and recoverable under the hood in many databases (see the implementation details in Postgresql manual for instance).
Tom Leys
+2  A: 

This is a bit of a hack. The raw SQL will make your code less portable, but it'll get rid of the race condition on the counter increment. In theory, this should increment the counter any time you do a query. I haven't tested this, so you should make sure the list gets interpolated in the query properly.

class VisitorDayTypeCounterManager(models.Manager):
    def get_query_set(self):
        qs = super(VisitorDayTypeCounterManager, self).get_query_set()

        from django.db import connection
        cursor = connection.cursor()

        pk_list = qs.values_list('id', flat=True)
        cursor.execute('UPDATE table_name SET counter = counter + 1 WHERE id IN %s', [pk_list])

        return qs

class VisitorDayTypeCounter(models.Model):
    ...

    objects = VisitorDayTypeCounterManager()
iconoplast
Isn't it still possible that the database will execute this query over two separate connections concurrently and still have a (much lower probability) race condition? It all depends on hidden transactions around this query implimented by the connection layer that make op atomic.
Tom Leys
If you watch the "Why I Hate Django" keynote from DjangoCon, this type of query is given as a proper, race condition-free way to do an increment in SQL (the hitch being that Django's ORM can't do it for you).
iconoplast
I'll watch your slides... you've pretty much confirmed my suspicion that the ORM wouldn't do it on its own.Thanks for the help!
Tony Arkles
If you look at Django Ticket 7210 – "Added expression support for QuerySet.update", it may seem that help is on the way to do this in a portable way using the ORM...
Ber
+1  A: 

You can use path from http://code.djangoproject.com/ticket/2705 for support database level locking.

With path this code will be atomic:

visitors = VisitorDayTypeCounter.objects.get(day=curday).for_update()
visitors.counter += 1
visitors.save()
+3  A: 

As of Django 1.1 you can use the ORM's F() expressions. For more details see the documentation:

http://docs.djangoproject.com/en/1.1/ref/models/instances/#updating-attributes-based-on-existing-fields

bjunix
Cool! That's an awesome approach. If only that had been there when I was working on this project.
Tony Arkles