views:

804

answers:

2

Possible Duplicate:
Screen Scraping from a web page with a lot of Javascript

I just want to do tasks such as form entry and web scraping, but on sites that require javascript support. And I also need to enter forms, scrape, and so on in the same session. Ideally, I'd like a way to control a web browser from the command line. And I also want to only use Linux for all this, so I can't use .NET.

I found the webbrowser library for Python, but its capabilities look very limited. If that could interface with mechanize and BeautifulSoup, that'd be amazing. Any suggestions? Thanks!

+1  A: 

You could certainly write a XUL application with Mozilla (run it with Firefox, Xulrunner etc) which scripts a web browser. Javascript is normally used for such tasks.

What I've found is tricky is suppressing all the kinds of dialogue boxes which the browser would otherwise create - you effectively have to override the behaviour of the XPCOM server classes which are invoked for each type of dialogue, and there are a lot of different ones (for example, if your site decides to redirect to a https site with an expired certificate).

Of course you should NOT use such a mechanism to violate any site's policy on use by robots. Normally you should never submit a form with a robot.

MarkR
Never knew about XUL before. Thanks, I'll look it.
Lin