ansaurus

Question

crawl a website for data at frequent intervals

Answer 1

+4 A:

Load WWW::Mechanize for crawling, use the mirror method inherited from LWP::UserAgent.

daxim 2010-09-07 12:16:49

Answer 2

A:

Use sleep to control wait period, and use WWW::Mechanize for data retrieval:

use strict;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new();
my $url = "http://www.nytimes.com";  # a sample webpage
while (1) {
    $mech->get($url);
    print $mech->content(format => 'text');  # read docs for WWW::Mechanize for advanced content processing
    sleep 300;  # wait for 5 minutes
}

EDIT: improved the sample content retrieval process.

Zhang18 2010-09-07 12:24:33

Be a good Web citizen and make that [GET request conditional](http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.3). If the page has not changed, it does not need to be downloaded again.

daxim 2010-09-07 12:32:29

Downvote: direct access to the innards of the object is not any good. Use the [`decoded_content` method inherited from `HTTP::Message`](http://p3rl.org/HTTP::Message#%24mess-%3Edecoded_content%28_%25options_%29) instead.

daxim 2010-09-07 12:35:01

ansaurus

tags:

views:

answers:

crawl a website for data at frequent intervals

related questions