views:

40

answers:

2

I looked up articles about using LWP however I am still lost! On this site we find a list of many schools; see the overview-page and follow some of the links and get some result pages:

I want to parse the sites using LWP::UserAgent and for the parsing : want to use either HTML::TreeBuilder::XPath or HTML::TokeParser

At the moment I am musing bout choosing the right get-request! I have some issues with the LWP::Useragent. The subsite of the overview can be reached via direct links. but -note: each site has content. e.g. the following URLs of the above mentioned result-pages.

As a Novice here I cannot show you the endings of the different endings by posting the full URL but here you can see the endings:

id=21&extern_eid=709
id=21&extern_eid=789
id=21&extern_eid=1297
id=21&extern_eid=761

There are many different URLS that differ in the end of the URL. The question is : how to i run LWP::UserAgent? I want fetch and parse & ** all the - 1000 sites.**

Question; Does LWP do the job automatically!? Or do i have to set up LWP :: UserAgent that it will look up the different URLS automatically...

Solutions: Perhaps we have to count up form zero to 10000 with there

extern_eid=709 -(count from zero to 100000) here

www-db.sn.schule.de/index.php?id=21&extern_eid=709

BTW: Here the data for LWP User Agent;

REQUEST METHODS The methods described in this section are used to dispatch requests via the user agent. The following request methods are provided:

$ua->get( $url ) $ua->get( $url , $field_name => $value, ... )

This method will dispatch a GET request on the given $url. Further arguments can be given to initialize the headers of the request. These are given as separate name/value pairs. The return value is a response object. See HTTP::Response for a description of the interface it provides. There will still be a response object returned when LWP can't connect to the server specified in the URL or when other failures in protocol handlers occur.

The question is: How to use LWP::UserAgent on the above mentioned site the right way - effectively!?

I look forward to any and all help!

A: 

If I understand your question correctly, you are trying to use LWP::UserAgent on same URLs with different query arguments, and you are wondering if LWP::UserAgent provides a way for you to loop through the query arguments?

I don't think LWP::UserAgent has a method for you to do that. However, you can have a loop constructing the URLs and use LWP::UserAgent repeatedly:

for my $id (0 .. 100000)
{
    $ua->get($url."?id=21&extern_eid=".(709-$id))
    //rest of the code
}

Alternatively you can add a request_prepare handler that computes and add the query arguments before you send out the request.

Alvin
Many thanks Alvin! or - i better to do that: for my $i (0..10000) { $ua->get('http://www-db.sn.schule.de/index.php', id => 21, extern_uid => $i); # process reply} In any case, using a loop like this for that kind of job is a way to do it. I guess the LWP APIs does not aim to replace the functionality of the core Perl, and we can use Perl loops to query multiple URLs. If i create a loop - i have some that do not match results: what do i do in the subseqently parsing process. I will have a check!? i have to do a check to get rid of all those results that do not match.
thebutcher
A: 

You describe following links for the purpose of web scraping. The LWP subclass WWW::Mechanize does this more easily than your current attempt.

daxim