I am looking to setup a automated screen scraper that will run on Google app engine using python. I want it to scrape the site and put the specified results into a Entity in app engine. I am looking for some directions on what to use. I have seen beautifulsoup but wonder if people could recommend anything else that could run on Google App engine.
The other choice is lxml
, but it uses C code and so does not work on GAE.
Beautifulsoup runs fine on App Engine (just make sure to use 3.0.8, not the iffy 3.1.0). The main alternative, I think, would be html5lib -- I haven't tries it on App Engine but I believe it does run there (quite slowly -- if that's a problem I think you need to stick with BeautifulSoup), e.g. this service runs on App Engine and is based on html5lib.
I have used BeautifulSoup with great success parsing HTML. Problem is that's all BeautifulSoup does, is parse the HTML. I ended up writing all the http interactions using urlfetch.
To web-scrape my target I need a full fledged code driven browser that can execute javascript on my target site's pages. I think I'm having to dump the python app and go java so I can use HTMLUnit - prototyping underway. - mattb
I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.
I have mechanize in a zip file, mechanize.zip
. The index of this zip file looks like:
mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc
Then in my Python code,
import sys
sys.path.insert(0, 'mechanize.zip')
import mechanize
from mechanize._beautifulsoup import BeautifulSoup