views:

247

answers:

6

Are there any libraries or frameworks that provide the functionality of a browser, but do not need to actually render physically onto the screen?

I want to automate navigation on web pages (Mechanize does this, for example), but I want the full browser experience, including Javascript. Thus, I'd like to have a virtual browser of some sort, that I can use to "click on links" programmatically, have DOM elements and JS scripts render within it, and manipulate these elements.

Solution preferably in Python, but I can manage others.

A: 

Looks like http://watin.sourceforge.net/ might be a good way to go.

If you don't have to go pure Python, you could do IronPython since it's a C# project.

MStodd
A: 

take a look at this little doosy on ajaxian

http://ajaxian.com/archives/server-side-rendering-with-yui-on-node-js

It also talks about Aptana Jaxer which I think runs on a headless firefox so is basically the Mozilla browser engine in all it's glory.

Matt Smith
A: 

There is Kapow. Its pure Java and costs money:

http://kapowtech.com/

And there is Lixto: Its Eclipse based and uses Mozilla Gecko as rendering engine (unless they already changed it to WebKit, as they said they'll do years ago). Its very nice and also costs money:

http://www.lixto.com/?page_id=50

They are both graphical tools where you define the site navigation and what should be extracted by point and click. But you can also write xpath and regular expressions and even JavaScript that runs in the sites context.

I used them both in the lectures web data extraction and applied web data extraction at the technical university Vienna (Lixto is written by the Professor who held the lecture).

panzi
+1  A: 

HTMLUnit in Java is very good. I think it's only the Java implementations of headless browsers that manage to provide Javascript support.

MaxQ, I read about here, sounds like it might be interesting: "written in Java, generates Jython scripts"

i5m
A: 

There is an absolute ton of free software technology now available: take your pick at http://wiki.python.org/moin/WebBrowserProgramming but if you have specific questions join pyjamas-dev on google groups and i'll be happy to give further details, there. brief answer: you can run pywebkitgtk "headless", or you can use xulrunner (via python-hulahop) again using pygtk without actually doing "browserwidget.show()", and there's also pykhtml. also you could use python COM to connect to MSHTML.DLL.

these are all "cheat" methods: using python bindings to a graphical web browser engine without actually firing up the graphical bit. if you really wanted to put some serious hard-core programming in, you could create a "port" of webkit which was not connected to a GUI toolkit: as an experienced webkit programmer i'd put it as around... 2 weeks of full-time effort to make such a "headless" version of webkit.

l.

A: 

Try HtmlUnit !!!

Stefan