views:

948

answers:

5

I am trying to write a Python-based Web Bot that can read and interpret an HTML page, then execute an onClick function and receive the resulting new HTML page. I can already read the HTML page and I can determine the functions to be called by the onClick command, but I have no idea how to execute those functions or how to receive the resulting HTML code.

Any ideas?

+1  A: 

The only tool in Python for Javascript, that I am aware of is python-spidermonkey. I have never used it though.

With Jython you could (ab-)use HttpUnit.

Edit: forgot that you can use Scrapy. It supports Javascript through Spidermonkey, and you can even use Firefox for crawling the web.

stephan
A: 

Well obviously python won't interpret the JS for you (though there may be modules out there that can). I suppose you need to convert the JS instructions to equivalent transformations in Python.

I suppose ElementTree or BeautifulSoup would be good starting points to interpret the HTML structure.

SpliFF
A: 

To execute JavaScript, you need to do much of what a full web browser does, except for the rendering. In particular, you need a JavaScript interpreter, in addition to the Python interpreter.

One starting point might be python-spidermonkey. Depending on the specific JavaScript, you might have to provide a good DOM API to the spidermonkey, in addition to providing an XmlHttpRequest implementation.

Martin v. Löwis
A: 

You can try to leverage V8,

V8 is Google's open source, high performance JavaScript engine. It is written in C++ and is used in Google Chrome, Google's open source browser.

Calling it from Python may not be straightforward, without a framework to provide the DOM. Pyjamas has an experimental project, Pyjamas Desktop, providing V8 integration for Javascript execution.

Pyv8 is an experimental python v8 bindings and a python-javascript compiler.

gimel
A: 

For the browser part of this you might want to look into Mechanize, which basically is a webbrowser implemented as a Python library. http://pypi.python.org/pypi/mechanize/0.1.11 But as mentioned, the text n onClick is Javascript, and you'll need spidermonkey for that.

If you can make a generic support for spidermonkey in mechanize, I'm sure many people would be extremely happy. ;)

Mechanize may be overkill, maybe you just want to find specific parts of the HTML, and then lxml and BeautifulSoup both work well.

Lennart Regebro