tags:

views:

35

answers:

1

Hi,

I want to search the contents of files in a directory for words present in files in another directory. Is there a better way to do it than the following? (By better mean memory usage wise)

More specifically:

folder 1 has several files, each file has several lines of text. folder 2 has several files, each file has several words, each on its line. What I want to do is count the number of occurrences of each word in each file in folder 2 in each line of each file of folder 1. I hope that wasn't too confusing.

open my $output, ">>D:/output.txt";

my @files = <"folder1/*">;
my @categories = <"folder2/*">;
foreach my $file (@files){
    open my $fileh, $file || die "Can't open file $companyName";
    foreach my $line (<$fileh>){
        foreach my $categoryName (@categories){
            open my $categoryFile, $categoryName || die "Can't open file $categoryName";
            foreach my $word(<$categoryFile>){
                #search using regex                
            }
            #print to output
        }
    } 
}
+1  A: 

One obvious improvement is to open all the category files first in a separate loop and cache the words in them into a hash of arrays (hash key being the filename), or just one big array if you don't care which search word came from which file.

This will avoid having to re-read the search files for every line in every $file - AND help get rid of duplicate search words in the bargain.

use File::Slurp;
open my $output, ">>D:/output.txt";

my %categories = ();
my @files = <"folder1/*">;
my @categories = <"folder2/*">;
foreach my $categoryName (@categories) {
    my @lines = read_file($categoryName);
    foreach my $category (@lines) {
        chomp($category);
        $categories{$category} = 0;
    }
}
# add in some code to uniquify @categories

foreach my $file (@files) {
    open my $fileh, $file || die "Can't open file $companyName";
    foreach my $line (<$fileh>) {
        foreach my $category (@categories) {
            # count
        }
    }
    # output
}

Also, if these are real "words" - meaning a category of "cat" needs to match "cat dog" but not "mcat" - I would count the word usage by splitting instead of a regex:

foreach my $line (<$fileh>) {
    my @words = split(/\s+/, $line);
    foreach my $word (@words) {
        $categories{$word}++ if exists $categories{$word};
    }
}
DVK