ansaurus

Question

How can I recursively visit links without revisiting links?

Answer 1

+8 A:

Simplest thing to do would be to not reinvent the wheel and use the CPAN.

David Dorward 2009-05-22 19:06:53

Answer 2

A:

If you want to extract all the links from a page, I recommend using LinkExtor by Gisle Aas, and a quick CPAN search will show you it. You can then recursively traverse the found links by pushing them onto a list, and popping them off, checking first before traversing them if you have already visited them, using the hash as you have done.

rpkelly 2009-05-22 20:08:49

Answer 3

+2 A:

I would guess that the problem is that

foreach my $url (keys %urls_to_check) {...}

is not recurring in the way that you think it is. For every URL you recover, you have to recursively call your function one more time, which is very memory-inefficient.

Although you are writing a program to "recursively" crawl web pages, in your code you need to use iteration, not recursion:

sub fetch_and_parse {
    my ($url) = @_;
    $urls_to_check{$url} = 1;
    while(%urls_to_check) {
        // Grab a URL and process it, putting any new URLs you find into urls_to_check
    }
  }

Of course, as other posters have noted, there are other tools that can automate this for you.

Anon Guy 2009-05-23 10:36:08

Answer 4

A:

Hi, maybe this can help you: blog.0x53a.de/where-do-my-links-go/ It does a Breadth-first search starting at a given website. Also the used module HTML::LinkExtractor may be interesting for you.

Regards, Manuel

2009-05-24 10:40:31

Answer 5

+1 A:

If you have a queue of links to check and you want to skip duplicates, use a hash to note the ones that you've already visited. Skip the links that are in that hash:

my @need_to_check   = ( ... ); # however you make that list
my %already_checked = ();

while( my $link = shift @need_to_check )
    {
    next if exists $already_checked{$link};
    ...;
    $already_checked{$link}++;
    }

The situation is slightly more complicated with URLs that look slightly different but end up at the same resource, like http://example.com, http://www.example.com, http://www.example.com/, and so on. If I cared about those, I'd add a normalization step by creating a URI object for each then pull the URL out again as a string. If it was a bigger problem, I'd also look at the URL that the response headers claimed I got (say, through redirection and so on) and mark that I'd seen those as well.

brian d foy 2009-05-26 13:37:57

ansaurus

tags:

views:

answers:

How can I recursively visit links without revisiting links?

related questions