views:

406

answers:

4

I have a Django view, which receives part of its data from an external website, which I parse using urllib2/BeautifulSoup.

This operation is rather expensive so I cache it using the low-level cache API, for ~5 minutes. However, each user which accesses the site after the cached data expires will receive a significant delay of a few seconds while I go to the external site to parse the new data.

Is there any way to load the new data lazily so that no user will ever get that kind of delay? Or is this unavoidable?

Please note that I am on a shared hosting server, so keep that in mind with your answers.

EDIT: thanks for the help so far. However, I'm still unsure as to how I accomplish this with the python script I will be calling. A basic test I did shows that the django cache is not global. Meaning if I call it from an external script, it does not see the cache data going on in the framework. Suggestions?

Another EDIT: coming to think of it, this is probably because I am still using local memory cache. I suspect that if I move the cache to memcached, DB, whatever, this will be solved.

+6  A: 

So you want to schedule something to run at a regular interval? At the cost of some CPU time, you can use this simple app.

Alternatively, if you can use it, the cron job for every 5 minutes is:

*/5 * * * * /path/to/project/refresh_cache.py

Web hosts provide different ways of setting these up. For cPanel, use the Cron Manager. For Google App Engine, use cron.yaml. For all of these, you'll need to set up the environment in refresh_cache.py first.

By the way, responding to a user's request is considered lazy caching. This is pre-emptive caching. And don't forget to cache long enough for the page to be recreated!

Mark
+2  A: 

"I'm still unsure as to how I accomplish this with the python script I will be calling. "

The issue is that your "significant delay of a few seconds while I go to the external site to parse the new data" has nothing to do with Django cache at all.

You can cache it everywhere, and when you go to reparse the external site, there's a delay. The trick is to NOT parse the external site while a user is waiting for their page.

The trick is to parse the external site before a user asks for a page. Since you can't go back in time, you have to periodically parse the external site and leave the parsed results in a local file or a database or something.

When a user makes a request you already have the results fetched and parsed, and all you're doing is presenting.

S.Lott
A: 

You can also use a python script to call your view and write it to a file, then deliver it staticaly with lightpd for example :

request = HttpRequest()
request.path = url # the url of your view
(detail_func, foo, params) = resolve(url)
params['gmap_key'] = settings.GMAP_KEY_STATIC
detail = detail_func(request, **params)
out = open(dir + "index.html", 'w')
out.write(detail.content)
out.close()

then call your script with a cron

fredz
+2  A: 

I have no proof, but I've read BeautifulSoup is slow and consumes a lot of memory. You may want to look at using the lxml module instead. lxml is supposed to be much faster and efficient, and can do much more than BeautifulSoup.

Of course, the parsing probably isn't your bottleneck here; the external I/O is.

First off, use memcached!

Then, one strategy that can be used is as follows:

  • Your cached object, called A, is stored in the cache with a dynamic key (A_<timestamp>, for example).
  • Another cached object holds the current key for A, called A_key.
  • Your app would then get the key for A by first getting the value at A_key
  • A periodic process would populate the cache with the A_<timestamp> keys and upon completion, change the value at A_key to the new key

Using this method, all users every 5 minutes won't have to wait for the cache to be updated, they'll just get older versions until the update happens.

Grant