I want to check a site for links, and then recursively check those sites for links. But I don't want to fetch the same page twice. I'm having trouble with the logic. This is Perl code:
my %urls_to_check = ();
my %checked_urls = ();
&fetch_and_parse($starting_url);
use Data::Dumper; die Dumper(\%checked_urls, \%urls_to_check);
sub fetch_and_parse {
my ($url) = @_;
if ($checked_urls{$url} > 1) { return 0; }
warn "Fetching 'me' links from $url";
my $p = HTML::TreeBuilder->new;
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req, sub { $p->parse($_[0])});
$p->eof();
my $base = $res->base;
my @tags = $p->look_down(
"_tag", "a",
);
foreach my $e (@tags) {
my $full = url($e->attr('href'), $base)->abs;
$urls_to_check{$full} = 1 if (!defined($checked_urls{$full}));
}
foreach my $url (keys %urls_to_check) {
delete $urls_to_check{$url};
$checked_urls{$url}++;
&fetch_and_parse($url);
}
}
But this doesn't seem to actually do what I'm wanting.
Help?!
EDIT: I'm wanting to fetch the URLs from the $starting_url
, and then fetch any and all URLs from the resulting fetches. But, if one of the URLs links back to $starting_url
, I don't want to fetch that again.