views:

270

answers:

3

I want to extract all the links from a page. I am using HTML:LinkExtor. How do I extract all the links that point to HTML content pages only?

I also cannot extract these kinds of links:

javascript:openpopup('http://www.admissions.college.harvard.edu/financial_aid/index.html'),

EDIT: HTML Pages - text/html. I am not indexing pictures etc.

+2  A: 

Yes, HTML::LinkExtor does not understand javascript. In fact, it's pretty unlikely that you'll get anything that recognizes URLs embedded in javascript, simply because that would require typically running actual code.

Randal Schwartz
+1  A: 

Perl is going to have a lot of ways to do this through brute force. You could use the Push/Pull Parser to jump between tags. You might be able to just slurp the entire page and regexp through it for links, or for links within JavaScript.

Have you looked at WWW::Mechanize::Plugin::JavaScript? The WWW::Mechanize module is a web botting best friend (not that you are trying to bot). I've used this module before and can say its one of the best Perl module on CPAN.

Here is an example from CPAN: Sets the named variable to the value given

$m->plugin('JavaScript')->set(
      'document', 'location', 'href' => 'http://www.perl.org/');
JulianK
It is a great module, and it's FAQ is very funny, particularly because so many people ask for javascript support...http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize/FAQ.pod
AmbroseChapel
A: 

I'd use WWW::Mechanize for most link gathering. Other than that I'd do my own matching:

my @links = $content =~ m`javascript:openpopup\('([^\']+)'`g;
chris d