ansaurus

Question

Setting up a python screen scraper that could work on Google App engine

Answer 1

A:

The other choice is lxml, but it uses C code and so does not work on GAE.

Ignacio Vazquez-Abrams 2010-03-09 01:42:13

OK I really need it to run on GAE

cozza 2010-03-09 01:54:17

Answer 2

+1 A:

Beautifulsoup runs fine on App Engine (just make sure to use 3.0.8, not the iffy 3.1.0). The main alternative, I think, would be html5lib -- I haven't tries it on App Engine but I believe it does run there (quite slowly -- if that's a problem I think you need to stick with BeautifulSoup), e.g. this service runs on App Engine and is based on html5lib.

Alex Martelli 2010-03-09 02:24:30

Ahh ok Beautifulsoup it is have you seen any good code examples of Beautifulsoup setup and posting to an entity on app engine

cozza 2010-03-09 02:45:41

@cozza, it all depends on what info you want to scrape and store -- I suggest opening another question in which you specify that.

Alex Martelli 2010-03-09 02:48:13

@AlexMartelli I took your advice and reposted here : http://stackoverflow.com/questions/2406428/i-want-to-scrape-a-site-using-gae-and-post-the-results-into-a-google-entityThank you for the tip

cozza 2010-03-09 03:33:47

Answer 3

A:

I have used BeautifulSoup with great success parsing HTML. Problem is that's all BeautifulSoup does, is parse the HTML. I ended up writing all the http interactions using urlfetch.

To web-scrape my target I need a full fledged code driven browser that can execute javascript on my target site's pages. I think I'm having to dump the python app and go java so I can use HTMLUnit - prototyping underway. - mattb

Matt Brown 2010-04-17 22:41:36

Answer 4

A:

I have had good (although slow) results using mechanize and BeautifulSoup. In fact, to save code space on Google App Engine, I use the (old) version of BeautifulSoup included in mechanize.

I have mechanize in a zip file, mechanize.zip. The index of this zip file looks like:

mechanize/
mechanize/__init__.py
mechanize/_auth.py
mechanize/_beautifulsoup.py
mechanize/_clientcookie.py
... etc

Then in my Python code,

import sys
sys.path.insert(0, 'mechanize.zip')

import mechanize
from mechanize._beautifulsoup import BeautifulSoup

pix 2010-10-16 01:22:22

ansaurus

tags:

views:

answers:

Setting up a python screen scraper that could work on Google App engine

related questions