ansaurus

Question

Emulate javascript _dopostback in python, web scrapping

Answer 1

+1 A:

The mechanize page is not suggesting that you can emulate JavaScript in Python. It is saying that you can change a hidden field in a form, thus tricking the web server that a human¹ has selected the field. You still need to analyse the target yourself.

There will be no Python-based solution to this problem, unless you wish to create a JavaScript interpreter in Python.

My thoughts on this problem have led me to three possible solutions:

create an XULRunner application
browser automation
attempt to interpret the client-side code

Of those three, I've only really seen discussion of 2. I've seen something close to 1 in a commercial scraping application, where you basically create scripts by browsing on sites and selecting things on the pages that you would like the script to extract in the future.

1 could possibly made to work with a Python script by accepting a serialisation (JSON ?) of wsgi Request objects, getting the app to fetch the URL, then sending the processed page as a wsgi Response object. You could possibly wrap some middleware around urllib2 to achieve this. Overkill probably, but kind of fun to think about.

2 is usually achieved via Selenium RC (Remote Control), a testing-centric tool. It provides a few methods like getHtmlSource but most people that I've heard using it get don't like its API.

3 I have no idea about. node.js is very hot right now, but I haven't touched it. I've never been able to build spidermonkey on my Ubuntu machine, so I haven't touched that either. My hunch is that in order to do this, you would provide the HTML source and your details to a JS interpreter, that would need to fake being your User-Agent etc in case the JavaScript wanted to reconnect with the server.

¹ well, more technically, a JavaScript compliant User-Agent, which is almost always a web browser used by a human

Tim McNamara 2010-10-10 09:17:04

Thanks, for a great explanation of why I can't do it :-) Actually at least now I have a better understanding of the problem. Thanks again.

Vincent 2010-10-10 14:23:34

This seems like a solution. http://code.google.com/p/spynner/

Vincent 2010-10-10 23:59:14

Sweet! Nice find, thanks!

Tim McNamara 2010-10-11 00:39:48

Answer 2

A:

The best method is to use a web browser instead. We use iMacros for Firefox for web scraping with very good success. It also works from Python (we use it with C#).

The drawback with using a web browser is that you do not get the same performance as with a headless tool like Mechanize. But the huge advantage is that it works with any website.

SamMeiers 2010-10-10 11:03:10

Well iMacros kinda lead me to find http://juicedpyshell.googlecode.com/svn/trunk/doc/html/index.html which looks interesting.

Vincent 2010-10-10 14:45:17

Interesting tool, that said I find `mechanize` very slow for our needs. I use `scrapy`.

Tim McNamara 2010-10-10 19:10:19

ansaurus

tags:

views:

answers:

Emulate javascript _dopostback in python, web scrapping

related questions