tags:

views:

2170

answers:

4

So it seemed easy enough. Use a series of nested loops to go though a ton of URLs sorted by year/month/day and download the XML files. As this is my first script, I started with the loop; something familiar in any language. I ran it just printing the constructed URLs and it worked perfect. I then wrote the code to download the content and save it separately, and that worked perfect as well with a sample URL on multiple test cases. But when I combined these two bits of code, it broke, the program just got stuck and did nothing at all. I therefore ran the debugger and as I stepped through it, it became stuck on this one line:

warnings::register::import(/usr/share/perl/5.10/warnings/register.pm:25):25:vec($warnings::Bits{$k}, $warnings::LAST_BIT, 1) = 0;

If I just hit r to return from the subroutine it works and continues to another point on its way back down the call stack where something similar happens over and over for some time. The stack trace:

$ = warnings::register::import('warnings::register') called from file `/usr/lib/perl/5.10/Socket.pm' line 7

$ = Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7

$ = eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7

$ = require 'Socket.pm' called from file `/usr/lib/perl/5.10/IO/Socket.pm' line 12

$ = IO::Socket::BEGIN() called from file `/usr/lib/perl/5.10/Socket.pm' line 7

$ = eval {...} called from file `/usr/lib/perl/5.10/Socket.pm' line 7

$ = require 'IO/Socket.pm' called from file `/usr/share/perl5/LWP/Simple.pm' line 158

$ = LWP::Simple::_trivial_http_get('www.aDatabase.com', 80, '/sittings/1987/oct/20.xml') called from file `/usr/share/perl5/LWP/Simple.pm' line 136

$ = LWP::Simple::_get('http://www.aDatabase.com/1987/oct/20.xml') called from file `xmlfetch.pl' line 28

As you can see it is getting stuck inside this "get($url)" method, and I have no clue why? Here is my code:

#!/usr/bin/perl

use LWP::Simple;

$urlBase = 'http://www.aDatabase.com/subheading/';
$day=1;
$month=1;
@months=("list of months","jan","feb","mar","apr","may","jun","jul","aug","sep","oct","nov","dec");
$year=1987;
$nullXML = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<nil-classes type=\"array\"/>\n";

while($year<=2006)
    {
    $month=1;
    while($month<=12)
     {
     $day=1;
     while($day<=31)
      {
      $newUrl = "$urlBase$year/$months[$month]/$day.xml";
      $content = get($newUrl);
      if($content ne $nullXML)
       {
       $filename = "$year-$month-$day.xml";
       open(FILE, ">$filename");
       print FILE $content;
       close(FILE);
       }
      $day++;
      }
     $month++;
     }
    $year++;
    }

I am almost positive it is something tiny I just dont know, but google has not turned up anything.

Thanks in advance,

B.

EDIT: It's official, it just hangs forever inside this get method, runs for several loops then hangs again for a while. But its still a problem. Why is this happening?

A: 

Hey B,

I have yet to use Perl, but at first glance I'm wondering if the exception is thrown as a result of a 404 error. I would imagine that the function would just return undef if the HTTP response was either a 404, 403, redirect, etc, but maybe that is not the case.

I might recommend using wget for this. Something like `wget $url` I think would work.

Anyway, as I said, I'm not a Prl programmer, but since the link you posted is in fact 404, that's my guess.

Let me know if you find that is the issue.

regex
It's not, i just decided to conceal the actual link. it works fine, like I said I tested the get method and saving separately. When there is no entry for a day on hte site it returns an XML exactly like the value I check against with $nullXML. I will try wget and see if it works.
gnomed
wget works but its a little more finicky since it saves it right away and i dont want every file to be written to the disk until it passes the check. i could check after, but at the cost of more disk I/Os. thanks though it is defintely useful for other cases.
gnomed
+3  A: 

Since http://www.adatabase.com/1987/oct/20.xml is a 404 (and isn't something that can be generated from your program anyway (no 'subheading' in the path), I'm assuming that isn't the real link you are using, which makes it hard for us to test. As a general rule, please use example.com instead of making up hostnames, that's why it is reserved.

You should really

use strict;
use warnings;

in your code - this will help highlight any scoping issues you may have (I'd be surprised if it was the case, but there is a chance that a part of the LWP code is messing around with your $urlBase or something). I think it should be enough to change the inital variable declarations (and $newUrl, $content and $filename) to put 'my' in front to make your code strict.

If using strict and warnings doesn't get you any closer to a solution, you could warn out the link you are about to use each loop so when it sticks you can try it in a browser and see what happens, or alternatively using a packet sniffer (such as Wireshark) could give you some clues.

Cebjyre
this worked, adding those "use" statements and throwing a "my" in front of everything and there we go. Like I said, something tiny I didn't know. Many thanks, and sorry I'm new to some conventions, will remember for the future.
gnomed
@gnomed: example.com is more than a convention it is in RFC 2606.
J.F. Sebastian
@gnomed: look into Date::Format and Date::Parse; you can collapse all your date loops into a single loop, and simultaneously avoid dates like '2005-02-31'.
kyle
+1  A: 

(2006 - 1986) * 12 * 31 is more then 7000. Requesting web pages without a pause is not nice.

Slightly more Perl-like version (code-style wise):

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple qw(get);    

my $urlBase = 'http://www.example.com/subheading/';
my @months  = qw/jan feb mar apr may jun jul aug sep oct nov dec/;
my $nullXML = <<'NULLXML';
<?xml version="1.0" encoding="UTF-8"?>
<nil-classes type="array"/>
NULLXML

for my $year (1987..2006) {
    for my $month (0..$#months) {
        for my $day (1..31) {
            my $newUrl = "$urlBase$year/$months[$month]/$day.xml";
            my $content = "abc"; #XXX get($newUrl);
            if ($content ne $nullXML) {
               my $filename = "$year-@{[$month+1]}-$day.xml";
               open my $fh, ">$filename" 
                   or die "Can't open '$filename': $!";
               print $fh $content;
               # $fh implicitly closed
            }
        }
    }
}
J.F. Sebastian
Quick heads-up: Perl subtly casts the min..max ranges to an array, then issues an iterator over it (at least, ActivePerl on Windows). Benchmark the behavior of min..max versus ( my $i = min; $i < max; ++$i ), and it is about 10x slower (as of my last test). Been slowly migrating all my scripts :)
kyle
thanks for the tidy version, can't say I'm at that level yet, this is still my second day. but about the website requests, I changed my program since getting it to work to set pauses for slightly more reasonable request rates.
gnomed
@kyle: You are using an ancient Perl version. `for $i ($min..$max)` is faster and doesn't consume more memory than `for ($i=$min; $i<=$max; ++$i)`.
J.F. Sebastian
A: 

LWP has a getstore function that does most of the fetching then saving work for you. You might also check out LWP::Parallel::UserAgent and a bit more control over how you hit the remote site.

brian d foy