views:

298

answers:

3

Suppose I have a simple view which needs to parse data from an external website.

Right now it looks something like this:

def index(request):
    source = urllib2.urlopen(EXTERNAL_WEBSITE_URL)
    bs = BeautifulSoup.BeautifulSoup(source.read())
    finalList = [] # do whatever with bs to populate the list
    return render_to_response('someTemplate.html', {'finalList': finalList})

First of all, is this an acceptable use?

Obviously, this is not good performance-wise. The external website page is pretty big, and I am only extracting a small part of it. I thought of two solutions:

  1. Do all of this asynchronously. Load the rest of the page, populate with data once I get it. But I don't even know where to start. I'm just starting with Django and never done anything async up until now.
  2. I don't care if this data is updated every 2-3 minutes, so caching is a good solution as well (also saves me the extra round-trips). How would I go about caching this data?
+4  A: 

First, don't optimize prematurely. Get this to work.

Then, add enough logging to see what the performance problems (if any) really are.

You may find that end-user's PC is the slowest part; getting data from another site may, actually, be remarkably fast when you do not fetch .JS libraries and .CSS and artwork and the render then entire thing in a browser.

Once you're absolutely sure that the fetch of the remote content really IS a problem. Really. Then you have to do the following.

  1. Write a "crontab" script that does the remote fetch form time to time.

  2. Design a place to cache the remote results. Database or file system, pick one.

  3. Update your Django app to get the data from the cache (database or filesystem) instead of the remote URL.

Only after you have absolute proof that the urllib2 read of the remote site is the bottleneck.

S.Lott
+1  A: 

Django has robust, built-in support for caching views: http://docs.djangoproject.com/en/dev/topics/cache/#topics-cache.

It offers solutions for caching entire views (such as in your case), or just certain parts of data in the view. There are even controls for how often to update the cache, and so forth.

Jarret Hardie
+2  A: 

Caching with django is pretty easy,

from django.core.cache import cache
key = 'some-key'
data = cache.get(key)
if data is None:
    # soupify the page and what not
    cache.set(data, key, 60*60*8)
    return render_to_response ...
return render_to_response

To answer your questions, you can do this asynchronously, but then you would have to use something like django cron to update the cache ever so often. On the other hand you can write this as a standalone python script, replace the cache imported from django with memcache and it would work the same way. It would reduce some of the performance issues your site could have, and as long as you know the cache key, you can retrieve the data from the cache.

Like Jarret said I would read django's caching docs and memcache's docs for more information.

notzach
soupify? I love it! :-)
Jarret Hardie