views:

13

answers:

3

I found a project, jaxer which embeds Firefox's JavaScript engine on the server side, so it can parse HTML server-side very well. But, this project seems dead. It is really helpful for crawling web pages to parse HTML & extract data.

Is there some new technology useful for extracting information?

A: 

What I've done in the past is use Selenium RC to control a web browser (usually firefox) from code to load and parse websites using a real web browser.

The cool thing about this is that you're mostly coding in a language you're comfortable with be it Perl or Ruby or C#. But to fully use the power of Selenium you still need to know and write javascript.

slebetman
A: 

Another interesting way to do this is to use node.js in conjunction with jsdom and node-htmlparser to load a page and parse the javascript in it. It is not really working out of the box yet at the moment but Dav Glass (from Yahoo) have had success running YUI in node.js using a modified version of this combo.

This is interesting if you decide that nothing out there is good enough and you want to implement your own. If so it would make an excellent open source project.

slebetman
note: I'm adding this as a separate answer because it is a radically different solution to my previous answer.
slebetman
A: 

I've had some success writing a js-enabled crawler in python + pywebkitgtk + javascript. It's much slower than a traditional crawler, but it gets the job done and can do cool stuff like make screenshots and pick up content that's been 'obscured' by js injection.

There's a decent article with some example code here:

http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

no