views:

254

answers:

5

I want to automate interaction with a webpage. I've been using pycurl up til now but eventually the webpage will use javascript so I'm looking for alternatives . A typical interaction is "open the page, search for some text, click on a link (which opens a form), fill out the form and submit".

We're deploying on Google App engine, if that makes a difference.

Thanks.

Clarification: we're deploying the webpage on appengine. But the interaction is run on a separate machine. So selenium seems like it's the best choice.

+2  A: 

What about Selenium? (http://seleniumhq.org)

Neil Santos
A: 

Check out mechanize. It should be able to handle your "typical interaction" pretty easily. Another option might be Selenium, but I've never used it personally.

Will McCutchen
Does mechanize do JS?
Paul Biggar
@Paul Biggar: No.
nosklo
A: 

twill is very lightweight but works well.

John Paulett
Thanks John, twill would be great except that it doesn't seem to support javascript and that's the next step for my app.
nafe
+1  A: 

Twill and mechanize don't do Javascript, and Qt and Selenium can't run on App Engine ((1)), which only supports pure Python code. I do not know of any pure-Python Javascript interpreter, which is what you'd need to deploy a JS-supporting scraper on App Engine:-(.

Maybe there's something in Java, which would at least allow you to deploy on (the Java version of) App Engine? App Engine app versions in Java and Python can use the same datastore, so you could keep some part of your app in Python... just not the part that needs to understand Javascript. Unfortunately I don't know enough about the Java / AE environment to suggest any specific package to try.

((1)): to clarify, since there seems to be a misunderstanding that has gotten so far as to get me downvoted: if you run Selenium or other scrapers on a different computer, you can of course target a site deployed in App Engine (it doesn't matter how the website you're targeting is deployed, what programming language[s] it uses, etc, etc, as long as it's a website you can access [[real website: flash, &c, may likely be different]]). How I read the question is, the OP is looking for ways to have the scraping run as part of an App Engine app -- that is the problematic part, not where you (or somebody else;-) runs the site being scraped!

Alex Martelli
Thanks Alex, that's useful. Would python spidermonkey do the trick? Otherwise, I guess I should start looking for Java libraries...
nafe
nafe, what are you deploying on App Engine? The page that contains the form or are you actually trying to deploy the automation script to App Engine. If you are running th automation script outside of App Engine, Selenium would be the way to go. Python spidermonkey won't work on App Engine--there is a ton of C under the hood. If you are going with java, look at HTMLUnit, it can handle some javascript.
John Paulett
Selenium does work on Appengine. Se:RC using python can walk through the site quite easily. I test my app engine app that way
AutomatedTester
@AutomatedTester, you're not running Selenium on App Engine's production instance -- you're running it on another computer, and then you can of course "walk through sites" that you can reach, no matter how they're deployed.
Alex Martelli
Hi Alex, I'm sorry about the ambiguity in my question. I should have phrased it more carefully. I want to do the interaction from a different machine and not within app engine. Thanks very much for your help -- it's really appreciated.
nafe
A: 

Did you try using QtWebKit with PyQt, you can load a specific url and read the content from Python. You could then search for urls and use Webkit again to access it. I think all those can be done with some basic Django(assuming you are using Django on GAE) view testing which will test the response code. Here's a sample QtWebKit PyQt code to get your started if you want to do it the GUI way:

import sys
import time

from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

app = QApplication(sys.argv)

web = QWebView()

settings = web.settings()
settings.setAttribute(QWebSettings.PluginsEnabled, True)
settings.setAttribute(QWebSettings.JavaEnabled, True)
settings.setAttribute(QWebSettings.JavascriptCanOpenWindows, True)
settings.setAttribute(QWebSettings.JavascriptCanAccessClipboard, True)
settings.setAttribute(QWebSettings.DeveloperExtrasEnabled, True)
settings.setAttribute(QWebSettings.ZoomTextOnly, True)



settings.setOfflineStoragePath('.')
settings.setIconDatabasePath (".")

url = 'http://stackoverflow.com'

web.load(QUrl(url))

web.show()

sys.exit(app.exec_())
Thierry Lam
Qt doesn't run on app engine (you can _target_ sites deployed in app engine, of course, but you can't _run_ Qt as part of you GAE app).
Alex Martelli