Is there any python module for rendering a HTML page with javascript and get back a DOM object?
I want to parse a page which generates almost all of its content using javascript.
Is there any python module for rendering a HTML page with javascript and get back a DOM object?
I want to parse a page which generates almost all of its content using javascript.
Only way I know to accomplish this would be to drive real browser, for example using selenium-rc.
The big complication here is emulating the full browser environment outside of a browser. You can use stand alone javascript interpreters like Rhino and SpiderMonkey to run javascript code but they don't provide a complete browser like environment to full render a web page.
If I needed to solve a problem like this I would first look at how the javascript is rendering the page, it's quite possible it's fetching data via AJAX and using that to render the page. I could then use python libraries like simplejson and httplib2 to directly fetch the data and use that, negating the need to access the DOM object. However, that's only one possible situation, I don't know the exact problem you are solving.
Other options include the selenium one mentioned by Łukasz, some kind of webkit embedded craziness, some kind of IE win32 scripting craziness or, finally, a pyxpcom based solution (with added craziness). All these have the drawback of requiring pretty much a fully running web browser for python to play with, which might not be an option depending on your environment.
QtWebKit is contained in PyQt4, but I don't know if you can use it without showing a widget. After a cursory look over the documentation, it seems to me you can only get HTML, not a DOM tree.
You can probably use python-webkit for it. Requires a running glib and GTK, but that's probably less problematic than wrapping the parts of webkit without glib.
I don't know if it does everything you need, but I guess you should give it a try.
when I need do to this kind of thing I have IE do all the heavy lifting. There are a couple automated testing frameworks out there that can give you access to the DOM of IE. http://sourceforge.net/projects/pamie'>Pamie is probably the best documented. You will need to download the win 32 extensions for it to work. There are some other options but they have next to no documentation. ishy_broser.py is one of them. Go http://www.ishpeck.net/index.php?P=b1115239318ishpeck>here and http://www.ishpeck.net/index.php?P=b1115225809ishpeck>here to get the script, and what little documentation there is. Lastly there is http://www.mayukhbose.com/python/IEC/index.php'>IEC.py I can't say I have ever used this one but it is the simplest one out there.
A combination of text-based browser (e.g., links2) and lxml
(or standard library's ElementTree
), subprocess
python's modules will suffice for simple cases.
PyXPCOM Firefox extension allows Python to be used inside the browser.