views:

368

answers:

5

I'm pulling some RSS feeds into a datastore in App Engine to serve up to an iPhone app. I use cron to schedule updating the RSS every x minutes. Each task only parses one RSS feed (which has 15-20 items). I frequently get warnings about high CPU usage in the App Engine dashboard, so I'm looking for ways to optimise my code.

Currently, I use minidom (since it's already there on App Engine), but I suspect it's not very efficient!

Here's the code:

 dom = minidom.parseString(urlfetch.fetch(url).content)
    if dom:
        items = []
        for node in dom.getElementsByTagName('item'):
            item = RssItem(
                key_name = self.getText(node.getElementsByTagName('guid')[0].childNodes),
                title = self.getText(node.getElementsByTagName('title')[0].childNodes),
                description = self.getText(node.getElementsByTagName('description')[0].childNodes),
                modified = datetime.now(),
                link = self.getText(node.getElementsByTagName('link')[0].childNodes),
                categories = [self.getText(category.childNodes) for category in node.getElementsByTagName('category')]
            );
            items.append(item);
        db.put(items);

def getText(self, nodelist):
    rc = ''
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc

There isn't much going on, but the scripts often take 2-6 seconds CPU time, which seems a bit excessive for looping through 20ish items and reading a few attributes.

What can I do to make this faster? Is there anything particularly bad in the above code, or should I change to another way of parsing? Are there are any libraries (that work on App Engine) that would be better, or would I be better parsing the RSS myself?

+1  A: 

I'd try ElementTree or the Universal Feed Parser and see if they're any better. ElementTree is in the stdlib as of Python 2.5, so it's available on App Engine.

Will McCutchen
I'll check these out. I've used minidom before and found it to be very CPU-intensive, so I'm sure there has to be something much more efficient out there.
Danny Tuppeny
A: 

You probably should run a profiler to pinpoint where the code is spinning its wheels. It could be waiting on the connections as some RSS feeds are REAL slow.

Also, some RDF/RSS/ATOM libraries build in a governor to keep from beating the cr*p out of the host when retrieving multiple feeds from the same site. I've written several aggregators and being considerate to the server is important.

Universal Feed Parser is full-featured, at least from what I've seen by looking through the docs. I didn't use it because I wrote my aggregators in Ruby and had different needs but I was aware of it and would consider it for a Python-based solution.

Greg
"It could be waiting on the connections". That should not count as CPU time, though.
Thilo
As Thilo says, it shouldn't count as CPU time while waiting for the server response. Do you know of any profiles that will give detailed info? I've used Red-Gate ANTs for .NET (which is great), but I haven't seen anything like that for Python (I've use Guido's AppStats, but it's just API calls).
Danny Tuppeny
+4  A: 

Outsource feed parsing via for example superfeedr

You could also look into superfeedr.com. They have a reasonable free quota/paying plans. They will do the polling(within 15 minutes you get updates) for you/etc. If the feeds also support pubsubhubbub, then you will receive the feeds in realtime! This video will explain to you what pubsubhubbub is if you don't know yet.

Improved feed parser written by Brett Slatkin

I would also advice you to watch this awesome video from Brett Slatkin explaining pubsubhubbub. I also remember that somewhere in the presentation he says that he does not use Universal Feedparser because it's just does to much work for his problem. He wrote his own SAX(14:10 in video presentation he talks about it a little bit) parser which is lightning fast. I guess you should check out the pubsubhubbub code to find out how he accomplished this.

Alfred
+1. Let someone else do most of the work for you, then you only need to parse the new entries. It doesn't have to be superfeedr, either: Any hubbub hub that supports polling will do.
Nick Johnson
This is more work than I was hoping for, since I believe 6 seconds to parse 20 items is terrible, and I hoped to fix it with relatively small changes.That said, I need my feeds to be up to date for the 1% of items are new, so I am doing a lot of duplicate work, so this may be a worthwhile change. I'll get the rest of the app working and consider this when all else is done :-)
Danny Tuppeny
@Nick Johnson I thought that superfeedr was the only public pubsubhubbub which does support polling. Do you know others too?
Alfred
I'm not aware of any - which isn't to say none exist. You can run your own on appspot, even. I just wanted to point out that superfeedr isn't unique. :)
Nick Johnson
I'll check out Brett's video - thanks!
Danny Tuppeny
Your welcome. I really enjoyed watching it.
Alfred
+1  A: 

If you have a low amount of traffic coming to your site you might be experiencing spin up times for your app. If an app is idle for a as little as a few minutes app engine will spin down your app to save resources. When the next request comes in the app has to be spun up before it can handle the request and this all gets added to your cpu quota. If you search the appengine newsgroup you see that it is full of complaints about this.

I use superfeedr for my site www.newsfacet.com and I notice that when superfeedr notifies me most of the time I can handle a few rss articles in a few hundred milliseconds. If its been a while since the last input this time can jump to 10 or 11 seconds as it incurs the spin up cost.

dunelmtech
I concur that the spin up time is probably your problem in this instance.
Finbarr
This is entirely possible - I'll investigate :-)
Danny Tuppeny
Implement the servlet init method and log a message from there that should tell you each time a spin up occurs.
dunelmtech
+1  A: 

In regards to using PubSubHubbub to let someone else do the work for you, you may find my blog post on using hubbub on App Engine to be useful.

Nick Johnson
+1 for your blog :). You really post good articles.
Alfred