views:

333

answers:

2

Many of my views fetch external resources. I want to make sure that under heavy load I don't blow up the remote sites (and/or get banned).

I only have 1 crawler so having a central lock will work fine.

So the details: I want to allow at most 3 queries to a host per second, and have the rest block for a maximum of 15 seconds. How could I do this (easily)?

Some thoughts :

  • Use django cache
    • Seems to only have 1 second resolution
  • Use a file based semaphore
    • Easy to do locks for concurrency. Not sure how to make sure only 3 fetches happen a second.
  • Use some shared memory state
    • I'd rather not install more things, but will if I have to.
+1  A: 

What about using a different process to handle scraping, and a queue for the communication between it and Django?
This way you would be able to easily change the number of concurrent requests, and it would also automatically keep track of the requests, without blocking the caller.
Most of all, I think it would help lowering the complexity of the main application (in Django).

Roberto Liffredo
Doing so would require me maintaining another long running process outside of apache. Slightly more sysadmin complexity... hmm....
Paul Tarjan
A: 

One approach; create a table like this:

class Queries(models.Model):
    site = models.CharField(max_length=200, db_index=True)
    start_time = models.DateTimeField(null = True)
    finished = models.BooleanField(default=False)

This records when each query has either taken place, or will take place in the future if the limiting prevents it from happening immediately. start_time is the time the action is to start; this is in the future if the action is currently blocking.

Instead of thinking in terms of queries per second, let's think in terms of seconds per query; in this case, 1/3 second per query.

Whenever an action is to be performed, do the following:

  • Create a row for the action. q = Queries.objects.create(site=sitename)
  • On the object you just created (q.id), atomically set start_time to the greatest start_time for this site plus 1/3 second. If the greatest is 10 seconds in the future, then we can start our action at 10 1/3 seconds. If that time is in the past, clamp it to now().
  • If the start_time that was just set is in the future, sleep until that time. If it's too far in the future (eg. over 15 seconds), delete the row and error out.
  • When the query is finished, set finished to True, so the row can be purged later on.

The atomic action is what's important. You can't simply do an aggregate on Queries and then save it, since it'll race. I don't know if Django can do this natively, but it's easy enough in raw SQL:

UPDATE site_queries
SET start_time = MAX(now(), COALESCE(now(), (
    SELECT MAX(start_time) + 1.0/3 FROM site_queries WHERE site = site_name
)))
WHERE id = object_id

Then, reload the model and sleep if necessary. You'll also need to purge old rows. Something like Queries.objects.filter(site=site, finished=True).exclude(id=id).delete() will probably work: delete all finished queries except the one you just made. (That way, you never delete the latest query, since later queries need that to be scheduled.)

Finally, make sure the UPDATE doesn't take place in a transaction. Autocommit must be turned on for this to work. Otherwise, the UPDATE won't be atomic: it'd be possible for two requests to UPDATE at the same time, and receive the same result. Django and Python typically have autocommit off, so you need to turn it on and then back off. With Postgres, this is connection.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT) and ISOLATION_LEVEL_READ_COMMITTED. I don't know how to do this with MySQL.

(I consider the default of having autocommit turned off in Python's DB-API to be a seriously design flaw.)

The benefit of this approach is that it's quite simple, with straightforward state; you don't need things like event listeners and wakeups, which have their own sets of problems.

A possible issue is that if the user cancels the request during the delay, whether or not you do the action, the delay is still enforced. If you never start the action, other requests won't move down into the unused "timeslot".

If you're not able to get autocommit to work, a workaround would be to add a UNIQUE constraint to (site, start_time). (I don't think Django understands that directly, so you'd need to add the constraint yourself.) Then, if the race happens and two requests to the same site end up at the same time, one of them will throw a constraint exception that you can catch, and you can just retry. You could also use a normal Django aggregate instead of raw SQL. Catching constraint exceptions isn't as robust, though.

Glenn Maynard
Another benefit of this, by the way, is that it keeps the state inside the database, which means the limiting works even if you have multiple servers all making requests separately, and other similar cases.
Glenn Maynard
Worked great. I didn't do the atomic commit since a race condition isn't too bad (hitting the site with an extra request) vs the increased program complexity.
Paul Tarjan
Good point. Sometimes, the habitual aim for correctness can miss the practical significance.
Glenn Maynard
Another improvement, your method did 9 queries per fetch. I only do the cleanup 1% of the time (random number) and removed the `finished` attribute in favor of a `models.Queries.objects.filter(site=site, start_time__le=datetime.now()).delete()`. Then I only have 1 query and 1% of the time, 2 queries per fetch.
Paul Tarjan
I didn't do that because if the cleanup happened between another request creating its request's start_time and reading the value from the database, it'd delete the row before the other request read it back. (If you're looking up the time separately before inserting, then that particular issue isn't there anymore.) As this is bookkeeping for remote requests, I didn't bother trying to optimize it.
Glenn Maynard
By the way, how do you have only 1 query? Computing start_time takes one, and creating the row takes one. Those could be combined (put the MAX calculation in a raw INSERT), but then you'd need another query to read back the value that was inserted.
Glenn Maynard