I want to scrape a site using GAE and post the results into a Google Entity

views:

202

answers:

+1 Q:

I want to scrape a site using GAE and post the results into a Google Entity

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc

Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine.

I want to know the best way to go about it?

Chris

+3 A:

There are several nice screen scraping libraries you can use in Python.

Perhaps the easiest to knock up an advanced scraper with is scrapy. It relies on Twisted to implement the main engine but provides a very easy to use interface for implementing custom scraping code.

Otherwise you can look at doing it more manually with something like BeautifulSoup, or Mechanize which provides a "mechanical" browser implementation.

BeautifulSoup and Mechanize should both work out of the box on App Engine - it provides a wrapper around httplib and urllib that uses urlfetch as a backend. Only scrapy will be problematic, due to its use of twisted. [thanks to Nick Johnson for the update].

jkp 2010-03-09 03:34:54

GAE provides the urlfetch module as a way to bypass the socket opening restriction.

gnibbler 2010-03-09 03:57:10

@gnibbler: thats good to know: I guess the issue is that it will no work out of the box with any of the frameworks I listed so it would mean writing something from the ground up. BeuatifulSoup could still be used to process the results though. Thanks for the heads-up +1.

jkp 2010-03-09 05:17:12

you can still use urllib2 on GAE but it is then wrapped around urlfetch, with some functionality removed

Plumo 2010-03-09 05:39:42

Nick Johnson 2010-03-09 09:28:13

@Nick Johnson: really good to know. I may try this myself at some point.

jkp 2010-03-10 00:56:38

@jkp Perhaps you could update your reply so it doesn't say you're out of luck in the last paragraph, then?

Nick Johnson 2010-03-11 10:52:06

@Nick Johnson: updated, credited you with the find. Thanks again. +1

jkp 2010-03-12 01:25:42

+2 A:

For normalizing HTML using a pure Python library I have had better experiences with html5lib than BeautifulSoup.

However you just want to extract simply structured information, which doesn't actually require normalizing the HTML. I have a few scraping apps on Google App Engine which use my own xpath library that works with raw HTML. Or you can use regular expressions for one off jobs.

Plumo 2010-03-09 05:45:18

For scraping data like this, Automation Anywhere. Anytime man.. anytime!! :)

Claren 2010-03-12 10:16:43

ansaurus

tags:

views:

answers:

I want to scrape a site using GAE and post the results into a Google Entity

related questions