views:

94

answers:

2

I'm new to Perl/HTML things. I'm trying to use $mech->get($url) to get something from a periodic table on http://en.wikipedia.org/wiki/Periodic_table but it kept returning error message like this:

Error GETing http://en.wikipedia.org/wiki/Periodic_table: Forbidden at PeriodicTable.pl line 13

But $mech->get($url) works fine if $url is http://search.cpan.org/.

Any help will be much appreciated!


Here is my code:

#!/usr/bin/perl -w

use strict;
use warnings;
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new( autocheck => 1 );

$mech = WWW::Mechanize->new();

my $table_url = "http://en.wikipedia.org/wiki/Periodic_table/";

$mech->get( $table_url );
+7  A: 

It's because Wikipedia deny access to some programs based on the User-Agent supplied on the request.

You can alias yourself to appear as a 'normal' web browser by setting the agent after instantiation and before the get(), for example:

$mech->agent( 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.17.8 (KHTML, like Gecko) Version/5.0.1 Safari/533.17.8' );

That worked for me with the URL in your posting. Shorter strings will probably work too.

(You should remove the trailing slash from the URL too I think.)

WWW::Mechanize is a subclass of LWP::UserAgent - see docs there for more info, including on the agent() method.

You should limit your use of this method of access though. Wikipedia explicitly deny access to some spiders in their robots.txt file. The default user agent for LWP::UserAgent (which starts with libwww) is in the list.

martin clayton
Thanks Martin! It works!
Z.Zen
You should also look at the [agent_alias](http://search.cpan.org/perldoc?WWW::Mechanize#$mech-%243Eagent_alias%28_%24alias_%29) method, which lets you easily impersonate common browsers without having to remember that big version string.
cjm
A: 

When you have these sorts of problems, you need to watch the HTTP transactions so you can see what the webserver is sending back to you. In this case, you'd see that Mech connects and gets a response, but Wikipedia is declining to respond to your bot. I like HTTP Scoop on the Mac.

brian d foy