ansaurus

Question

my first perl script: using "get($url)" method in a loop?

Answer 1

A:

Hey B,

I have yet to use Perl, but at first glance I'm wondering if the exception is thrown as a result of a 404 error. I would imagine that the function would just return undef if the HTTP response was either a 404, 403, redirect, etc, but maybe that is not the case.

I might recommend using wget for this. Something like `wget $url` I think would work.

Anyway, as I said, I'm not a Prl programmer, but since the link you posted is in fact 404, that's my guess.

Let me know if you find that is the issue.

regex 2009-01-21 21:45:34

It's not, i just decided to conceal the actual link. it works fine, like I said I tested the get method and saving separately. When there is no entry for a day on hte site it returns an XML exactly like the value I check against with $nullXML. I will try wget and see if it works.

gnomed 2009-01-21 21:50:00

wget works but its a little more finicky since it saves it right away and i dont want every file to be written to the disk until it passes the check. i could check after, but at the cost of more disk I/Os. thanks though it is defintely useful for other cases.

gnomed 2009-01-21 22:24:37

Answer 2

+3 A:

Since http://www.adatabase.com/1987/oct/20.xml is a 404 (and isn't something that can be generated from your program anyway (no 'subheading' in the path), I'm assuming that isn't the real link you are using, which makes it hard for us to test. As a general rule, please use example.com instead of making up hostnames, that's why it is reserved.

You should really

use strict;
use warnings;

in your code - this will help highlight any scoping issues you may have (I'd be surprised if it was the case, but there is a chance that a part of the LWP code is messing around with your $urlBase or something). I think it should be enough to change the inital variable declarations (and $newUrl, $content and $filename) to put 'my' in front to make your code strict.

If using strict and warnings doesn't get you any closer to a solution, you could warn out the link you are about to use each loop so when it sticks you can try it in a browser and see what happens, or alternatively using a packet sniffer (such as Wireshark) could give you some clues.

Cebjyre 2009-01-21 22:08:39

this worked, adding those "use" statements and throwing a "my" in front of everything and there we go. Like I said, something tiny I didn't know. Many thanks, and sorry I'm new to some conventions, will remember for the future.

gnomed 2009-01-21 22:19:11

@gnomed: example.com is more than a convention it is in RFC 2606.

J.F. Sebastian 2009-01-21 23:02:18

@gnomed: look into Date::Format and Date::Parse; you can collapse all your date loops into a single loop, and simultaneously avoid dates like '2005-02-31'.

kyle 2009-01-22 06:58:47

Answer 3

+1 A:

(2006 - 1986) * 12 * 31 is more then 7000. Requesting web pages without a pause is not nice.

Slightly more Perl-like version (code-style wise):

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple qw(get);    

my $urlBase = 'http://www.example.com/subheading/';
my @months  = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML

for my $year (1987..2006) {
    for my $month (0..$#months) {
        for my $day (1..31) {
            my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            my $content = "abc"; #XXX get($newUrl);
            if ($content ne $nullXML) {
               my $filename = "$year-@{[$month+1]}-$day.xml";
               open my $fh, ">$filename" 
                   or die "Can't open '$filename': $!";
               print $fh $content;
               # $fh implicitly closed
            }
        }
    }
}

J.F. Sebastian 2009-01-21 22:45:06

Quick heads-up: Perl subtly casts the min..max ranges to an array, then issues an iterator over it (at least, ActivePerl on Windows). Benchmark the behavior of min..max versus ( my $i = min; $i < max; ++$i ), and it is about 10x slower (as of my last test). Been slowly migrating all my scripts :)

kyle 2009-01-22 06:55:41

thanks for the tidy version, can't say I'm at that level yet, this is still my second day. but about the website requests, I changed my program since getting it to work to set pauses for slightly more reasonable request rates.

gnomed 2009-01-22 09:03:52

@kyle: You are using an ancient Perl version. `for $i ($min..$max)` is faster and doesn't consume more memory than `for ($i=$min; $i<=$max; ++$i)`.

J.F. Sebastian 2009-01-22 11:17:39

Answer 4

A:

LWP has a getstore function that does most of the fetching then saving work for you. You might also check out LWP::Parallel::UserAgent and a bit more control over how you hit the remote site.

brian d foy 2009-01-22 01:35:10

ansaurus

tags:

views:

answers:

my first perl script: using "get($url)" method in a loop?

related questions