views:

513

answers:

4

I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen beautifulsoup but wonder if people could recommend anything else that could run on Google App engine.

A: 

The other choice is lxml, but it uses C code and so does not work on GAE.

Ignacio Vazquez-Abrams
OK I really need it to run on GAE
cozza
+1  A: 

Beautifulsoup runs fine on App Engine (just make sure to use 3.0.8, not the iffy 3.1.0). The main alternative, I think, would be html5lib -- I haven't tries it on App Engine but I believe it does run there (quite slowly -- if that's a problem I think you need to stick with BeautifulSoup), e.g. this service runs on App Engine and is based on html5lib.

Alex Martelli
Ahh ok Beautifulsoup it is have you seen any good code examples of Beautifulsoup setup and posting to an entity on app engine
cozza
@cozza, it all depends on what info you want to scrape and store -- I suggest opening another question in which you specify that.
Alex Martelli
@AlexMartelli I took your advice and reposted here : http://stackoverflow.com/questions/2406428/i-want-to-scrape-a-site-using-gae-and-post-the-results-into-a-google-entityThank you for the tip
cozza
A: 

I have used BeautifulSoup with great success parsing HTML. Problem is that's all BeautifulSoup does, is parse the HTML. I ended up writing all the http interactions using urlfetch.

To web-scrape my target I need a full fledged code driven browser that can execute javascript on my target site's pages. I think I'm having to dump the python app and go java so I can use HTMLUnit - prototyping underway. - mattb

Matt Brown
A: 

I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have mechanize in a zip file, mechanize.zip. The index of this zip file looks like:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Then in my Python code,

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup
pix