ansaurus

Question

How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript?

Answer 1

A:

Read the FAQ. WWW::Mechanize doesn't do javascript. They're probably using javascript to change the page. You'll need a different approach.

Turtle 2010-10-10 16:00:09

I tried more approaches ("no page downloader, parser, getter I tried was able to help here"), but still no result.

Gurzo 2010-10-10 16:01:42

Answer 2

+4 A:

The problem is that mechanize mimics the networking layer of the browser but not the rendering or javascript execution layer.

Many folks use the web browser control provided by Microsoft. This is a full instance of IE in a control that you can host in a WinForm, WPF or plain old Console app. It allows you to, among other things, load the web page and run javascript as well as send and receive javascript commands.

Here's a reasonable intro into how to host a browser control: http://www.switchonthecode.com/tutorials/csharp-snippet-tutorial-the-web-browser-control

will 2010-10-10 16:09:45

Is it possible to implement that in Perl? Otherwise it's useless to me as I have to do this in Perl.

Gurzo 2010-10-10 16:21:10

Answer 3

+1 A:

A ton of data is sent over ajax requests. You need to account for that in your crawler somehow.

2010-10-10 16:10:39

Answer 4

+1 A:

It looks like they are using AJAX. I can see where the requests are being sent using FireBug. You may need to either pick up on this by trying to parse and execute javasript that affects the DOM.

Adam Smith 2010-10-10 16:13:22

Answer 5

+7 A:

To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.

Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.

Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.

rafl 2010-10-10 16:32:37

[`WWW::Mechanize::Firefox`](http://p3rl.org/WWW::Mechanize::Firefox) simplifies mozrepl, no need doing it the hard way.

daxim 2010-10-10 16:52:49

This seems the best idea so far (paired with the Mechanize::Firefox comment). I'm currently trying it, if it works I'm definitely accepting this as answer. Thanks

Gurzo 2010-10-10 17:27:45

It worked! I just had to make it wait for a particular class to appear and then get my data. Thanks again to both of you (I used Mechanize::Firefox)!

Gurzo 2010-10-10 18:16:56

Answer 6

+1 A:

You should be able to use WWW::HtmlUnit - it loads and executes javascript.

Clay Hinson 2010-10-10 17:37:18

ansaurus

tags:

views:

answers:

How can Perl's WWW::Mechanize expand HTML pages that add to themselves with JavaScript?

related questions