ansaurus

Question

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

Answer 1

+7 A:

Bringing the Browser to the Server by John Resig might be useful.

farinspace 2010-04-16 17:45:13

Very interesting link. For many years, I had to this stuff the hard way. I am actually a little disappointed that it will be orders of magnitude easier now.

Sinan Ünür 2010-04-16 17:49:02

lol... as sites increase dynamic components, I welcome the ease of use ... I remember having to setup a dedicated server in order to run a browser for similar purposes.

farinspace 2010-04-16 17:56:13

@farinspace Single dedicated server? How about four dedicated quad-CPU systems running 64 instances of IE simultaneously? ;-)

Sinan Ünür 2010-04-16 18:08:18

Answer 2

+9 A:

You'll need to reverse-engineer what the Javascript is doing. Does it fire off an AJAX request to populate the <div>? If so, it should be pretty easy to sniff the request using Firebug and then duplicate it with LWP::UserAgent or WWW::Mechanize to get the information.

If the Javascript is just doing pure DOM manipulation, then that means the data must exist somewhere else in the page or the Javascript already. So figure out where it's coming from and grab it.

Finally, if none of those options are adequate, you may need to just use a real browser to do it. There are a few options for automating browser behavior, like WWW::Mechanize::Firefox or Win32::IE::Mechanize.

friedo 2010-04-16 17:46:33

Answer 3

+4 A:

As the content of your page is generated by some Javascript, you need the ability to :

Execute some Javascript code
- Even, possibly, some complex JS code, doing Ajax requests and all that ?
And do it with an engine that supports the functions/methods that are present in a browser (like DOM manipulations)

A solution could be to actually really start a browser to navigate to that page, and, then, parse the page that's loaded by it, to extract the information ?

I've never used this for grabbing, but the Selenium suite might help, here : using Selenium RC, you can start a real browser, and pilot it -- then, you have functions to get data from it.

It's not quite fast, and it's pretty heavy (it has to start a browser !), but it works quite well : you'll be using Firefox, for example, to navigate to your page -- which means a real Javascript engine, that's used every day by a lot of people ;-)

Pascal MARTIN 2010-04-16 17:46:48

Answer 4

A:

This might be what your looking for (in PHP):

$url = 'http://downloadcenter.trendmicro.com/ajx/pattern_result.php';

$ch = curl_init();
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'q=patresult_page&reg=NABU');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($ch);
curl_close($ch);

echo $content;
exit;

once you get the content you can use something like: http://code.google.com/p/phpquery/ to parse the results you need or a similar perl equivalent???

And/or do the parsing yourself.

FYI: all I did was use firebug to inspect the requests and recreated it with PHP/CURL...

farinspace 2010-04-16 18:12:11

Answer 5

A:

Maybe you could use greasemonkey to get the internals from the browser itself.

Kinopiko 2010-04-17 15:06:14

Answer 6

A:

to work with the dynamically created HTML you can use the FireFox Chickenfoot plugin. Or if you need something that works from a command line script use bindings to Perl. I have done this with Python before.

Plumo 2010-04-20 05:29:36

ansaurus

tags:

views:

answers:

How can I use Perl to grab text from a web page that is dynamically generated with JavaScript?

related questions