tags:

views:

64

answers:

2

I'm doing some web scraping using Perl's LWP. I need to process a set of URLs, some of which may redirect (1 or more times).

How can I get ultimate URL with all redirects resolved, using HEAD method?

+5  A: 

As stated in perldoc LWP::UserAgent, the default is to follow redirects for GET and HEAD requests:

$ua = LWP::UserAgent->new( %options )

...
       KEY                     DEFAULT
       -----------             --------------------
       max_redirect            7
       ...
       requests_redirectable   ['GET', 'HEAD']

Here is an example:

#!/usr/bin/perl

use strict; use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
$ua->show_progress(1);

my $response = $ua->head('http://unur.com/');

if ( $response->is_success ) {
    print $response->request->uri->as_string, "\n";
}

Output:

** HEAD http://unur.com/ ==> 301 Moved Permanently (1s)
** HEAD http://www.unur.com/ ==> 200 OK
http://www.unur.com/
Sinan Ünür
Absolutely correct, but I think the OP wanted to know what the URLs actually were once all redirects had been followed.
Tony Miller
@Tony Thank you for the heads up. I did not notice it immediately and posted a sample script apparently after your answer was accepted.
Sinan Ünür
Oooh, I didn't see the uri->as_string method which shows the entire sequence. Very nice.
Tony Miller
@Tony, The sequence comes from `$ua->show_progress(1);`
Sinan Ünür
+3  A: 

If you use the fully featured version of LWP::UserAgent, then the response that is returned is an instance of HTTP::Response which in turn has as an attribute an HTTP::Request. Note that this is NOT necessarily the same HTTP::Request that you created with the original URL in your set of URLs, as described in the HTTP::Response documentation for the method to retrieve the request instance within the response instance:

$r->request( $request )

This is used to get/set the request attribute. The request attribute is a reference to the the request that caused this response. It does not have to be the same request passed to the $ua->request() method, because there might have been redirects and authorization retries in between.

Once you have the request object, you can use the uri method to get the URI. If redirects were used, the URI is the result of following the chain of redirects.

Here's a Perl script, tested and verified, that gives you the skeleton of what you need:

#!/usr/bin/perl

use strict;
use warnings;

use LWP::UserAgent;

my $ua;  # Instance of LWP::UserAgent
my $req; # Instance of (original) request
my $res; # Instance of HTTP::Response returned via request method

$ua = LWP::UserAgent->new;
$ua->agent("$0/0.1 " . $ua->agent);

$req = HTTP::Request->new(HEAD => 'http://www.ecu.edu/wllc');
$req->header('Accept' => 'text/html');

$res = $ua->request($req);

if ($res->is_success) {
    # Using double method invocation, prob. want to do testing of
    # whether res is defined.
    # This is inline version of
    # my $finalrequest = $res->request(); 
    # print "Final URL = " . $finalrequest->url() . "\n";
    print "Final URI = " . $res->request()->uri() . "\n";
} else {
    print "Error: " . $res->status_line . "\n";
}
Tony Miller
Thanks for the thorough explanation.
planetp