ansaurus

Question

Why does my Perl script keep reading from same file, even though I closed it?

Answer 1

+5 A:

You'll probably be able to debug this a lot faster if you convert using local lexicals for your filehandles instead of globals, as well as turn on strict checking:

use strict;
use warnings;

while (...)
{
    # ...
    open my $filehandle, $html_file;

    # ...
    my $line = <$filehandle>;
}

This way, the filehandle(s) will go out of scope during each loop iteration, so you can more clearly see what exactly is being referenced and where. (Hint: you may have missed a condition where the filehandle gets closed, so it is improperly reused the next time around.)

For more on best practices with open and filehandles, see:

Some other points:

Don't ever explicitly assign to $_, that's asking for trouble. Declare your own variable to hold your data: my $line = <$filehandle> (as in the example above)
Pull out your matches directly into variables, rather than using $1, $2 etc, and only use parentheses for the portions you actually need: my ($column_name) = ($line =~ m/Rank\s+Country\s+.+(\s+)Date/i);
put the error conditions first, so the bulk of your code can be outdented one (or more) level(s). This will improve readability, as when the bulk of your algorithm is visible on the screen at once, you can better visualize what it is doing and catch errors.

If you apply the points above I'm pretty sure that you'll spot your error. I spotted it while making this last edit, but I think you'll learn more if you discover it yourself. (I'm not trying to be snooty; trust me on this!)

Ether 2010-01-20 20:28:19

Thanks for the suggestion, I tried it, but it still isn't working right. Is there a way I can see what file a file handle is using?

Silmaril89 2010-01-20 21:24:04

@Silmaril89: see my last edit.

Ether 2010-01-20 21:27:03

So, how would I get the match directly into $column_name if the line is in $line?

Silmaril89 2010-01-20 21:47:09

@Silmaril89: with the `=~` operator: see http://perldoc.perl.org/perlop.html#Binding-Operators (or http://perldoc.perl.org/perlre.html)

Ether 2010-01-20 21:53:06

I followed your suggestions and I honestly am not sure what is different about my new code that made it work, but it does now. So, thank you.

Silmaril89 2010-01-20 23:50:39

@Silmaril89: woohoo! At first, I thought the issue might be the way you were opening and closing filehandles (and so I constructed my original response based around that), but I think the real issue was in these lines: `$_ =~ /Rank(\s+)Country(\s+)(.+)(\s+)Date/i; $column_name = $3;` -- you were grabbing the third match into $column_name, but there were four sets of matching parentheses (either you miscounted, or perhaps thought the matches counted up from 0?)

Ether 2010-01-20 23:55:33

What it really was, is that I didn't know what the parentheses meant to perl as far as regular expressions were concerned, but now I realize what they do, they are what you actually want matched or returned from the regular expression. And yes I think it had something to do with that. Thanks again.

Silmaril89 2010-01-21 01:25:33

Answer 2

A:

Have you considered grep?

grep out just the line from the HTML containing the title, and then process the output of grep.

Simpler, as you would not have to write any file-handling code. You didn't say what you want with that title - if you only need a list, you might not need to write any code at all.

Try something like:

grep -ri title <directoryname>

Bandi-T 2010-01-20 20:37:07

Answer 3

+2 A:

Your processing is similar for HTML and text files, so make your life easy and factor out the common part:

sub scrape {
  my($path,$pattern,$sep) = @_;

  unless (open FILE, $path) {
    warn "$0: skipping $path: $!\n";
    return;
  }

  local $/ = $sep;

  my $column_name;
  while (<FILE>) {
    next unless /$pattern/;
    $column_name = $1;
    last;
  }

  close FILE;

  ($path,$column_name);
}

Then make it specific for the two types of input:

sub scrape_html {
  my($directory,$i) = @_;

  scrape $directory.$i."rank.html", 
         qr{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i,
         "\n";
}

sub scrape_txt {
  my($directory,$i) = @_;

  scrape $directory.$i."rank.txt",
         qr/Rank\s+Country\s+(.+)\s+Date/i,
         "\r";
}

Then your main program is straightforward:

my $directory = shift or die "$0: must supply directory\n";
my $year      = shift or die "$0: must supply year\n";

die "$0: $directory is not a directory\n"
  unless -d $directory;

# add trailing slash if necessary
$directory =~ s{([^/])$}{$1/};

my $columns_file = "columns$year.txt";
open COLUMNS, ">", $columns_file
  or die "$0: open $columns_file: $!";

for (my $i = 2001; $i <= 2212; $i++) {
  my $process = $year >= 2009 ? \&scrape_html : \&scrape_txt;

  my($path,$column_name) = $process->($directory,$i);

  next unless defined $path;

  if (defined $column_name) {
    print "$0: Adding $column_name to text file\n";
    print COLUMNS "$column_name\n";
  }
  else {
    warn "$0: no column name in $path\n";
  }
}

close COLUMNS or warn "$0: close $columns_file: $!\n";

Note how careful you have to be to close global filehandles. Please use lexical filehandles as in

open my $fh, $path or die "$0: open $path: $!";

Passing $fh as a parameter or stuffing it in hashes is much nicer. Also, lexical filehandles close automatically when they go out of scope. There's no chance of stomping on a handle someone else is already using.

Greg Bacon 2010-01-20 23:17:38

ansaurus

tags:

views:

answers:

Why does my Perl script keep reading from same file, even though I closed it?

related questions