+3  A: 

There are several nice screen scraping libraries you can use in Python.

Perhaps the easiest to knock up an advanced scraper with is scrapy. It relies on Twisted to implement the main engine but provides a very easy to use interface for implementing custom scraping code.

Otherwise you can look at doing it more manually with something like BeautifulSoup, or Mechanize which provides a "mechanical" browser implementation.

BeautifulSoup and Mechanize should both work out of the box on App Engine - it provides a wrapper around httplib and urllib that uses urlfetch as a backend. Only scrapy will be problematic, due to its use of twisted. [thanks to Nick Johnson for the update].

jkp
GAE provides the urlfetch module as a way to bypass the socket opening restriction.
gnibbler
@gnibbler: thats good to know: I guess the issue is that it will no work out of the box with any of the frameworks I listed so it would mean writing something from the ground up. BeuatifulSoup could still be used to process the results though. Thanks for the heads-up +1.
jkp
you can still use urllib2 on GAE but it is then wrapped around urlfetch, with some functionality removed
Plumo
BeautifulSoup and Mechanize should both work out of the box on App Engine - it provides a wrapper around httplib and urllib that uses urlfetch as a backend. Only scrapy will be problematic, due to its use of twisted.
Nick Johnson
@Nick Johnson: really good to know. I may try this myself at some point.
jkp
@jkp Perhaps you could update your reply so it doesn't say you're out of luck in the last paragraph, then?
Nick Johnson
@Nick Johnson: updated, credited you with the find. Thanks again. +1
jkp
+2  A: 

For normalizing HTML using a pure Python library I have had better experiences with html5lib than BeautifulSoup.

However you just want to extract simply structured information, which doesn't actually require normalizing the HTML. I have a few scraping apps on Google App Engine which use my own xpath library that works with raw HTML. Or you can use regular expressions for one off jobs.

Plumo
A: 

For scraping data like this, Automation Anywhere. Anytime man.. anytime!! :)

Claren