tags:

views:

1599

answers:

5

I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -

sub searchDir($) {
    my $dirN = shift;
    my @dirList = glob("$dirN/*");
    for(@dirList) {
        push @fileList, $_ if -f $_;

    }
    @ARGV = @fileList;
    while(<>) {
        ## Search for pattern
    }
}

My question is - is it alright to manually load the @ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?

+1  A: 

Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from @ARGV; here, you are adding items, but it still changes the original value of @ARGV.

It makes no odds whether the code is in a subroutine or in the 'main function'.

Jonathan Leffler
+3  A: 

I would prefer this more explicit and readable version:

#!/usr/bin/perl -w 

foreach my $file (<$ARGV[0]/*>){
    open(F, $file) or die "$!: $file";
    while(<F>){
      # search for pattern
    }
    close F;
}

But it is also okay to manipulate @ARGV:

#!/usr/bin/perl -w 

@ARGV = <$ARGV[0]/*>;
while(<>){
    # search for pattern
}
To the original poster: note the use of globbing (<*>) rather than readdir().
nimrodm
+7  A: 

On the topic of manipulating @ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:

foreach my $file (@files) {
    open FILE, "<$file" or die "Error opening file $file ($!)";
    my @lines = <FILE>;
    close FILE or die $!;

    foreach my $line (@file) {
     if ( $line =~ /$pattern/ ) {
      # do something here!
     }
    }
}

That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).

Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.

Good luck! Perl is fun. :)

Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:

while ( my $line = <FILE> ) {
    if ( $line =~ /$pattern/ ) {
        # do something here!
    }
}

The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P

Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.

James Thompson
It's not necessary to store the whole file in memory first and go through it in a separate loop. Imagine he had a 4GB file, for example!
"Who even has 4GB hard drives?" - what decade are you in? My hard drive is 150GB. Perhaps you meant 4GB of memory, which is slightly more believable. But my computer has 2GB of memory. So what exactly are you talking about?
Chris Lutz
+1  A: 

The previous answers cover your main Perl-programming question rather well.

So let me comment on the underlying question: How to find a pattern in a bunch of files.

Depending on the OS it might make sense to call a specialised external program, say

grep -l <pattern> <path>

on unix.

Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).

lexu
Thanks a lot for the suggestion, I was actually trying out a basic port of egrep in perl in order to work on my file-io and regular expressions.
muteW
It's always a balance of keeping it in one place (e.g. egrep port to perl, thus all perl, self maintainable possibly portable solution, that contains home made errors) vs. using the tools provided by the OS (quick solution, tested, optimized, but potentially non portable).depends on the situation!
lexu
That's true. I would any day opt for the OS tools to get the job done but in this case I was trying to use my file-io knowledge of perl to strengthen my learning (picked up a Perl book a week or so ago).
muteW
A: 

The big issue with tweaking @ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in @ARGV or processing STDIN if @ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.

You can get the filename of the current file by checking $ARGV.

You may not realize it, but you are actually affecting two global variables, not just @ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.

You can reduce the impact of munging globals by using local to localize the changes.

BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between @ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.

use strict;
use warnings;

...

searchDir( 'foo' );

sub searchDir {
    my $dirN    = shift;
    my $pattern = shift;

    local $_;

    my @fileList = grep { -f $_ } glob("$dirN/*");

    return unless @fileList;  # Don't want to process STDIN.

    local @ARGV;

    @ARGV = @fileList;
    while(<>) {
        my $found = 0;
        ## Search for pattern
        if ( $found ) {
            print "Match at $. in $ARGV\n";
        }
    }
    continue {
        # reset line numbering after each file.
        close ARGV  if eof;  # don't use eof().
    }
}

WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking

Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().

In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.

This short script demonstrates local:

$foo = 'foo';

print_foo();
print_bar();
print_foo();

sub print_bar {
    local $foo;
    $foo = 'bar';
    print_foo();
}

sub print_foo {
    print "Foo: $foo\n";
}
daotoad
Any reason why you would use 'local' for the $_ and @ARGV variables instead of lexically scoping it with 'my'.
muteW
Thank you, the perlsub document was most helpful in explaining the need to use local when handling special global and punctuation variables.
muteW