views:

258

answers:

4

I need to write a Perl script to scrape a website. The website can only be scraped with JavaScript, and the user is on Windows.

I got some way with Win32::IE::Mechanize on my work machine, which has IE6, but then I moved to my netbook which has IE8, and can't even get as far as fetching a simple page.

Is Win32::IE::Mechanize up to date with the latest versions of IE?

But, more to the point, given a recent WinXP machine, what's the quickest, easiest way to scrape a site which only reveals its content via JavaScript?

A: 

Have a look at Win32::Watir. It's a newer module and explicitly supports IE 6, 7 and 8.

rjh
It looks great. But I can't even get it to run. It fails on new() and gives me an error message. "Odd number of elements in hash assignment at C:\Perl\site\lib\Win32\Watir.pm line 101 Can't locate object method "_startIE" via package "visble" at C:\Perl\site\lib\Win32\Watir.pm line 108" Any advice?
AmbroseChapel
Oh wait, it's the documentation -- it says Watir::new when it should say Watir->new -- it's working now. Though that didn't fill me with confidence...
AmbroseChapel
I hope you submitted a patch for the doc bug you found. http://rt.cpan.org :)
brian d foy
I didn't but I submitted a bug. I should go back and do the documentation thing, you're right.
AmbroseChapel
A: 

I don't see any mention of WWW::Mechanize, so I'll bring it up just for completeness. Selenium is also becoming very popular and can be used in a lot of testing scenarios.

Ether
WWW::Mechanize doesn't do JavaScript, that's really why I'm here asking this question.
AmbroseChapel
@AmbroseChapel: `WWW::Mechanize::Firefox` does support JavaScript.
Zaid
+2  A: 

WWW::Selenium.

  • It allows you to specify which browser to use (IE and Firefox are supported from the get-go)
  • It supports access to elements via xpath elements, table IDs, text (regex-matching!) and URLs
  • It provides a Swiss army knife of user-interaction options, giving you flexibility over how you wish to simulate end-user browsing

You'll need to download the Selenium Remote Control and have it running in the background for the module to work.

It may not be a good option if your page load times are unpredictable.

Zaid
That certainly looks good but the installation of the RC part isn't going to be any fun for my geographically remote, somewhat clueless clients...
AmbroseChapel
@AmbroseChapel: It's not so much an installation as it is a download. Once the file is in place, run it via `java -jar selenium-server.jar` in the background.
Zaid
A: 

WWW::Scripter and its ::Plugin::Javascript can probably help you.

sreservoir