Is there a module out there that can give me links to all the pages a website has??
Why I need it: I want to crawl some sites and search for tags in them, searching only on mainpage is not enough.
Thanks,
Is there a module out there that can give me links to all the pages a website has??
Why I need it: I want to crawl some sites and search for tags in them, searching only on mainpage is not enough.
Thanks,
The classic way to crawl sites in Perl is with WWW::Mechanize which has a links method that returns a list of all the links from the page. You can grab a page, get the links from it, and then use the follow_link() or get() method to get the linked page.
HTML::SimpleLinkExtor is a bit simple than HTML::LinkExtor. You might check out my half-hearted attempt at my webreaper tool that has some of the code that you'll probably need.
Another way to do this is to use HTML::TreeBuilder to parse the HTML from the page. It returns a tree of objects that you can use to grab all of the links from a page, and it can do much more, such as finding a link based on a regexp pattern you specify. Check out HTML::Element's documentation to see more.
To find all of the links in a page:
use HTML::TreeBuilder;
use LWP::Simple;
my $url = 'http://www.example.com/';
my $html = HTML::TreeBuilder->new_from_content(get($url));
my @links = $html->look_down('_tag' => 'a');
I believe LWP::Simple and HTML::TreeBuilder are both included in Ubuntu as well.