tags:

views:

109

answers:

3

Hi, I'm writing this Perl script that gets two command line arguments: a directory and a year. In this directory is a ton of text files or html files(depending on the year). Lets say for instance it's the year 2010 which contains files that look like this <number>rank.html with the number ranging from 2001 to 2212. I want it to open each file individually and take a part of the title in the html file and print it to a text file. However, when I run my code it just prints the first files title to the text file. It seems that it only ever opens the first file 2001rank.html and no others. I'll post the code below and thanks to anyone that helps.

my $directory = shift or "Must supply directory\n";
my $year = shift or "Must supply year\n";

unless (-d $directory) {
  die "Error: Directory must be a directory\n";
}

unless ($directory =~ m/\/$/) {
  $directory = "$directory/";
}

open COLUMNS, "> columns$year.txt" or die "Can't open columns file";
my $column_name;

for (my $i = 2001; $i <= 2212; $i++) {

  if ($year >= 2009) {
    my $html_file = $directory.$i."rank.html";
    open FILE, $html_file;

    #check if opened correctly, if not, skip it
    unless (defined fileno(FILE)) {
      print "skipping $html_file\n";
      next;
    }

    $/ = "\n";
    my $line = <FILE>;

    if (defined $line) {
      $column_name = "";
      $_ = <FILE> until m{</title>};
      $_ =~ m{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i;
      $column_name = $1;
    }
    else {
      close FILE;
      next;
    }
    close FILE;
  }
  else {
    my $text_file = $directory.$i."rank.txt";
    open FILE, $text_file;

    unless (defined fileno(FILE)) {
      print "skipping $text_file\n";
      next;
    }

    $/ = "\r";
    my $line = <FILE>;

    if (defined $line) {
      $column_name = "";
      $_ = <FILE> until /Rank/i;
      $_ =~ /Rank(\s+)Country(\s+)(.+)(\s+)Date/i;
      $column_name = $3;
    }
    else {
      close FILE;
      next;
    }
    close FILE;
  }

  print "Adding $column_name to text file\n";
  print COLUMNS "$column_name\n";
}

close COLUMNS;

In other words $column_name gets set equal to the same thing every pass in the loop, even though I know the html files are different.

+5  A: 

You'll probably be able to debug this a lot faster if you convert using local lexicals for your filehandles instead of globals, as well as turn on strict checking:

use strict;
use warnings;

while (...)
{
    # ...
    open my $filehandle, $html_file;

    # ...
    my $line = <$filehandle>;
}

This way, the filehandle(s) will go out of scope during each loop iteration, so you can more clearly see what exactly is being referenced and where. (Hint: you may have missed a condition where the filehandle gets closed, so it is improperly reused the next time around.)

For more on best practices with open and filehandles, see:

Some other points:

  • Don't ever explicitly assign to $_, that's asking for trouble. Declare your own variable to hold your data: my $line = <$filehandle> (as in the example above)
  • Pull out your matches directly into variables, rather than using $1, $2 etc, and only use parentheses for the portions you actually need: my ($column_name) = ($line =~ m/Rank\s+Country\s+.+(\s+)Date/i);
  • put the error conditions first, so the bulk of your code can be outdented one (or more) level(s). This will improve readability, as when the bulk of your algorithm is visible on the screen at once, you can better visualize what it is doing and catch errors.

If you apply the points above I'm pretty sure that you'll spot your error. I spotted it while making this last edit, but I think you'll learn more if you discover it yourself. (I'm not trying to be snooty; trust me on this!)

Ether
Thanks for the suggestion, I tried it, but it still isn't working right. Is there a way I can see what file a file handle is using?
Silmaril89
@Silmaril89: see my last edit.
Ether
So, how would I get the match directly into $column_name if the line is in $line?
Silmaril89
@Silmaril89: with the `=~` operator: see http://perldoc.perl.org/perlop.html#Binding-Operators (or http://perldoc.perl.org/perlre.html)
Ether
I followed your suggestions and I honestly am not sure what is different about my new code that made it work, but it does now. So, thank you.
Silmaril89
@Silmaril89: woohoo! At first, I thought the issue might be the way you were opening and closing filehandles (and so I constructed my original response based around that), but I think the real issue was in these lines: `$_ =~ /Rank(\s+)Country(\s+)(.+)(\s+)Date/i; $column_name = $3;` -- you were grabbing the third match into $column_name, but there were four sets of matching parentheses (either you miscounted, or perhaps thought the matches counted up from 0?)
Ether
What it really was, is that I didn't know what the parentheses meant to perl as far as regular expressions were concerned, but now I realize what they do, they are what you actually want matched or returned from the regular expression. And yes I think it had something to do with that. Thanks again.
Silmaril89
A: 

Have you considered grep?

grep out just the line from the HTML containing the title, and then process the output of grep.

Simpler, as you would not have to write any file-handling code. You didn't say what you want with that title - if you only need a list, you might not need to write any code at all.

Try something like:

grep -ri title <directoryname>
Bandi-T
+2  A: 

Your processing is similar for HTML and text files, so make your life easy and factor out the common part:

sub scrape {
  my($path,$pattern,$sep) = @_;

  unless (open FILE, $path) {
    warn "$0: skipping $path: $!\n";
    return;
  }

  local $/ = $sep;

  my $column_name;
  while (<FILE>) {
    next unless /$pattern/;
    $column_name = $1;
    last;
  }

  close FILE;

  ($path,$column_name);
}

Then make it specific for the two types of input:

sub scrape_html {
  my($directory,$i) = @_;

  scrape $directory.$i."rank.html", 
         qr{<title>CIA - The World Factbook -- Country Comparison :: (.+)</title>}i,
         "\n";
}

sub scrape_txt {
  my($directory,$i) = @_;

  scrape $directory.$i."rank.txt",
         qr/Rank\s+Country\s+(.+)\s+Date/i,
         "\r";
}

Then your main program is straightforward:

my $directory = shift or die "$0: must supply directory\n";
my $year      = shift or die "$0: must supply year\n";

die "$0: $directory is not a directory\n"
  unless -d $directory;

# add trailing slash if necessary
$directory =~ s{([^/])$}{$1/};

my $columns_file = "columns$year.txt";
open COLUMNS, ">", $columns_file
  or die "$0: open $columns_file: $!";

for (my $i = 2001; $i <= 2212; $i++) {
  my $process = $year >= 2009 ? \&scrape_html : \&scrape_txt;

  my($path,$column_name) = $process->($directory,$i);

  next unless defined $path;

  if (defined $column_name) {
    print "$0: Adding $column_name to text file\n";
    print COLUMNS "$column_name\n";
  }
  else {
    warn "$0: no column name in $path\n";
  }
}

close COLUMNS or warn "$0: close $columns_file: $!\n";

Note how careful you have to be to close global filehandles. Please use lexical filehandles as in

open my $fh, $path or die "$0: open $path: $!";

Passing $fh as a parameter or stuffing it in hashes is much nicer. Also, lexical filehandles close automatically when they go out of scope. There's no chance of stomping on a handle someone else is already using.

Greg Bacon