tags:

views:

175

answers:

5

I want to check a site for links, and then recursively check those sites for links. But I don't want to fetch the same page twice. I'm having trouble with the logic. This is Perl code:

my %urls_to_check = ();
my %checked_urls = ();

&fetch_and_parse($starting_url);

use Data::Dumper; die Dumper(\%checked_urls, \%urls_to_check);

sub fetch_and_parse {
    my ($url) = @_;

    if ($checked_urls{$url} > 1) { return 0; }
    warn "Fetching 'me' links from $url";

    my $p = HTML::TreeBuilder->new;

    my $req = HTTP::Request->new(GET => $url);
    my $res = $ua->request($req, sub { $p->parse($_[0])});
    $p->eof();

    my $base = $res->base;

    my @tags = $p->look_down(
        "_tag", "a",
    );

    foreach my $e (@tags) {
        my $full = url($e->attr('href'), $base)->abs;
        $urls_to_check{$full} = 1 if (!defined($checked_urls{$full}));
    }

    foreach my $url (keys %urls_to_check) {
        delete $urls_to_check{$url};
        $checked_urls{$url}++;
        &fetch_and_parse($url);
    }
}

But this doesn't seem to actually do what I'm wanting.

Help?!

EDIT: I'm wanting to fetch the URLs from the $starting_url, and then fetch any and all URLs from the resulting fetches. But, if one of the URLs links back to $starting_url, I don't want to fetch that again.

+8  A: 

Simplest thing to do would be to not reinvent the wheel and use the CPAN.

David Dorward
A: 

If you want to extract all the links from a page, I recommend using LinkExtor by Gisle Aas, and a quick CPAN search will show you it. You can then recursively traverse the found links by pushing them onto a list, and popping them off, checking first before traversing them if you have already visited them, using the hash as you have done.

rpkelly
+2  A: 

I would guess that the problem is that

foreach my $url (keys %urls_to_check) {...}

is not recurring in the way that you think it is. For every URL you recover, you have to recursively call your function one more time, which is very memory-inefficient.

Although you are writing a program to "recursively" crawl web pages, in your code you need to use iteration, not recursion:

sub fetch_and_parse {
    my ($url) = @_;
    $urls_to_check{$url} = 1;
    while(%urls_to_check) {
        // Grab a URL and process it, putting any new URLs you find into urls_to_check
    }
  }

Of course, as other posters have noted, there are other tools that can automate this for you.

Anon Guy
A: 

Hi, maybe this can help you: blog.0x53a.de/where-do-my-links-go/ It does a Breadth-first search starting at a given website. Also the used module HTML::LinkExtractor may be interesting for you.

Regards, Manuel

+1  A: 

If you have a queue of links to check and you want to skip duplicates, use a hash to note the ones that you've already visited. Skip the links that are in that hash:

my @need_to_check   = ( ... ); # however you make that list
my %already_checked = ();

while( my $link = shift @need_to_check )
    {
    next if exists $already_checked{$link};
    ...;
    $already_checked{$link}++;
    }

The situation is slightly more complicated with URLs that look slightly different but end up at the same resource, like http://example.com, http://www.example.com, http://www.example.com/, and so on. If I cared about those, I'd add a normalization step by creating a URI object for each then pull the URL out again as a string. If it was a bigger problem, I'd also look at the URL that the response headers claimed I got (say, through redirection and so on) and mark that I'd seen those as well.

brian d foy