ansaurus

Question

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

Answer 1

A:

Hard to be too specific as your question is very general. I've retrieved pages using LWP and used TokeParser to extract data and store the output in a database many times. I haven't used Mech, but by all accounts it is simpler than LWP.

Creating a user agent using LWP can be as simple as:

my $ua = LWP::UserAgent->new();

you will need to consider things like re-directs, proxies and cookies or passwords depending on your requirements.

To follow re-directs:

$ua = LWP::UserAgent->new(
    requests_redirectable =>   ['GET', 'HEAD', 'POST' ]
);

To store cookies:

$ua->cookie_jar( {} );

To set up a proxy:

$ua->proxy("http", "http://localhost:8888");  # Fiddler

To add a password for authentication:

$ua->credentials( 'www.myhostingplace.com:443' , 'Realm' , 'userid', 'password');

To get content from a page for local processing:

$url = 'http://www.someurl.com'
my $response  = $ua->get($url);
if ( $response->is_error() ) {
   # Do some error stuff
}
my $content = $response->content();

To parse the content using TokeParser:

my $stream = new HTML::TokeParser(\$content);

while ( my $t = $stream->get_token() ) {
   if ( $t->[0] eq 'S' and $t->[1] eq 'input' ) {
      if ( uc( $t->[2]{ 'name' } ) eq 'SEARCHVALUE' ) {
           my $data = $t->[2]{ 'value' };
           # Do something with data
      }
   }
}

The data is passed into TokeParser as a reference; I then walk through the stream using get token. Each HTML element is passed into an array which you can examine to determine what you should do next.

In the above example I want to search for input tags with an attribute name of 'SEARCHVALUE' and then store the 'value' attribute. The HTML fragment might look something like this:

<input type="hidden" name="SEARCHVALUE" value="Spock" />

When I hit the start of the input tag ($t->[0] eq 'S' and $t->[1] eq 'input') I examine the "name" attribute of the tag (t->[2]{ 'name' }) to see if it matches the value I am searching for; if it does I store the value attribute of the tag ($t->[2]{ 'value' }) in a variable. I can then do whatever I like with the value including storing it in a database.

You can do a lot with TokeParser and in some cases it can be simpler than using regular expressions to carve up the page but it can also be a little challenging to get your head around. If you are trying to extract a simple pattern from the return HTML content then a regular expression can be just as good.

If you have a lot of this to do then I recommend "Perl and LWP" by Sean Burke from O'Reilly. It has been endlessly helpful for me in my web scraping endeavours.

Hope this helps you get started at least.

Auctionitis 2010-10-22 00:03:41

Hello Auctionitis, Many thanks for the reply. This helps me to get started. And yes: I have ordered the book "Perl and LWP" by Sean Burke allready. A great tipp! Note - i have lots of this to do. - i will come up with more questions and stuff - but now i first will do some exercises. Untill soon! Regards

thebutcher 2010-10-22 22:21:49

hello Auctionitis - again thanks for the message and your great hints. btw - i have made some more efforts - can you help me to debug. Love to hear from you - see the code to debug! http://stackoverflow.com/questions/4007480/applying-a-loop-on-lwpuseragent-to-fetch-1000-pages-at-once - love to hear from you. martin

thebutcher 2010-10-24 20:41:47

ansaurus

tags:

views:

answers:

fetch pages [LWP] parse them [HTML::TokeParser] and store results [DBI]

related questions