views:

110

answers:

4

Is there a module out there that can give me links to all the pages a website has??

Why I need it: I want to crawl some sites and search for tags in them, searching only on mainpage is not enough.

Thanks,

+2  A: 

You may find HTML::LinkExtor of use.

Greg Bacon
+5  A: 

The classic way to crawl sites in Perl is with WWW::Mechanize which has a links method that returns a list of all the links from the page. You can grab a page, get the links from it, and then use the follow_link() or get() method to get the linked page.

Drew Stephens
Thanks, I already know that module, but its too much(?) of an overhead to use it only for this function I guess
soulSurfer2010
+5  A: 

HTML::SimpleLinkExtor is a bit simple than HTML::LinkExtor. You might check out my half-hearted attempt at my webreaper tool that has some of the code that you'll probably need.

brian d foy
+1  A: 

Another way to do this is to use HTML::TreeBuilder to parse the HTML from the page. It returns a tree of objects that you can use to grab all of the links from a page, and it can do much more, such as finding a link based on a regexp pattern you specify. Check out HTML::Element's documentation to see more.

To find all of the links in a page:

use HTML::TreeBuilder;
use LWP::Simple;

my $url  = 'http://www.example.com/';
my $html = HTML::TreeBuilder->new_from_content(get($url));

my @links = $html->look_down('_tag'   => 'a');

I believe LWP::Simple and HTML::TreeBuilder are both included in Ubuntu as well.

James Kastrantas