tags:

views:

66

answers:

2

Hi,

I need to crawl a website and retrieve certain data that keeps getting updated every few minutes. How do i do this?

+4  A: 

Load WWW::Mechanize for crawling, use the mirror method inherited from LWP::UserAgent.

daxim
A: 

Use sleep to control wait period, and use WWW::Mechanize for data retrieval:

use strict;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new();
my $url = "http://www.nytimes.com";  # a sample webpage
while (1) {
    $mech->get($url);
    print $mech->content(format => 'text');  # read docs for WWW::Mechanize for advanced content processing
    sleep 300;  # wait for 5 minutes
}

EDIT: improved the sample content retrieval process.

Zhang18
Be a good Web citizen and make that [GET request conditional](http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.3). If the page has not changed, it does not need to be downloaded again.
daxim
Downvote: direct access to the innards of the object is not any good. Use the [`decoded_content` method inherited from `HTTP::Message`](http://p3rl.org/HTTP::Message#%24mess-%3Edecoded_content%28_%25options_%29) instead.
daxim