views:

232

answers:

3

Iterating over a queryset, like so:

class Book(models.Model):
    # <snip some other stuff>
    activity = models.PositiveIntegerField(default=0)
    views = models.PositiveIntegerField(default=0)

    def calculate_statistics():
        self.activity = book.views * 4
        book.save()

def cron_job_calculate_all_book_statistics():
    for book in Book.objects.all():
        book.calculate_statistics()

...works just fine. However, this is a cron task. book.views is being incremented while this is happening. If book.views is modified while this cronjob is running, it gets reverted.

Now, book.views is not being modified by the cronjob, but it is being cached during the .all() queryset call. When book.save(), I have a feeling it is using the old book.views value.

Is there a way to make sure that only the activity field is updated? Alternatively, let's say there are 100,000 books. This will take quite a while to run. But the book.views will be from when the queryset originally starts running. Is the solution to just use an .iterator()?

UPDATE: Here's effectively what I am doing. If you have ideas about how to make this work well inline, then I'm all for it.

def calculate_statistics(self):
    self.activity = self.views + self.hearts.count() * 2
    # Can't do self.comments.count with a comments GenericRelation, because Comment uses
    # a TextField for object_pk, and that breaks the whole system. Lame.
    self.activity += Comment.objects.for_model(self).count() * 4
    self.save()
+2  A: 

The following will do the job for you in Django 1.1, no loop necessary:

from django.db.models import F
Book.objects.all().update(activity=F('views')*4)

You can have a more complicated calculation too:

for book in Book.objects.all().iterator():
    Book.objects.filter(pk=book.pk).update(activity=book.calculate_activity())

Both these options have the potential to leave the activity field out of sync with the rest, but I assume you're ok with that, given that you're calculating it in a cron job.

Will Hardy
There is actually a lot of processing going on, including other fields. I just simplified it for the sake of this question. That would be a fine solution, but I've got a lot going on there that I don't think I can just inline.
Samuel Clay
I think you may have to live with the fact that 'activity' could be out of sync with other fields, which I assume you're already fine with, given that you're using a cron to update 'activity'. I've updated the answer to handle more complicated processing, while avoiding overwriting the object.
Will Hardy
Added `.iterator()`, as mentioned by Alex.
Will Hardy
+1  A: 

No matter how you solve this, beware of transaction-related issues. E.g. default transaction isolation level is set to REPEATABLE READ, at least for MySQL backend. This, plus the fact that both Django and db backend work in a specific autocommit mode with an ongoing transaction means, that even if you use (very nice) whrde suggestion, value of `views' could be no longer valid. I could be wrong here, but feel warned.

Tomasz Zielinski
+1  A: 

In addition to what others have said if you are iterating over a large queryset you should use iterator():

Book.objects.filter(stuff).order_by(stuff).iterator()

this will cause Django to not cache the items as it iterates (which could use a ton of memory for a large result set).

Alex Gaynor
.itertor() works great, but is there any way to "lock" the database so I can perform a read and then a subsequent write in an atomic fashion? Will this require dropping down to raw SQL, where it certainly is possible, or does Django provide a hook for this?
Samuel Clay