ansaurus

Question

How can I take queries from one file, search another, and output to a third, in Perl?

Answer 1

+1 A:

Are FILE1 and FILE2 initially sorted? If so, you only need a single loop, not a nested one:

use 5.010;
use warnings;
use strict;

my $dictFile = 'c:/FILE2.txt';
my $wordsFile = 'c:/FILE1.txt';
my $outFile = 'c:/FILE3.txt';

open my $dic, '<', $dictFile or die "Cannot open $dictFile: $!";
open my $filter, '<', $wordsFile or die "Cannot open $wordsFile: $!";
open my $learn, '>', $outFile or die "Cannot open $outFile: $!";

my $dic_line;
my $dic_word;
my $filter_word;

# loop forever (or until last'ing out of the loop, anyway)
while (1) {
    # if we don't have a word from the filter list, get one
    if ( ! defined $filter_word ) {
        # get a line from the filter file, bailing out of the loop if at the end
        $filter_word = <$filter> // last;
        # remove the newline so we can string compare
        chomp($filter_word);
    }
    # if we don't have a word from the dictionary, get one
    if ( ! defined $dic_line ) {
        # get a line from the dictionary, bailing out of the loop if at the end
        $dic_line = <$dic> // last;
        # get the first word on the line
        ($dic_word) = split ' ', $dic_line;
    }
    # if we have a match, print it
    if ( $dic_word eq $filter_word ) { print $learn $dic_line }
    # only keep considering this dictionary line if it is beyond the filter word we had
    if ( lc $dic_word le lc $filter_word ) { undef $dic_line }
    # only keep considering this filter word if it is beyond the dictionary line we had
    if ( lc $dic_word ge lc $filter_word ) { undef $filter_word }
}

ysth 2009-10-14 05:29:14

@ysth, yes, in my example, they are initially sorted. I invented the the problem to solve so that I could improve my Perl knowledge. But I don't know how to avoid nested loop. Can I?

Mike 2009-10-14 06:13:59

@ysth, thanks. But running the code gives me "uninitialized in concatenation" error.

Mike 2009-10-14 10:19:50

@Mike: on what line? I left out the setup stuff; will add that in case you did something differently.

ysth 2009-10-14 15:45:08

@Mike: I have the feeling you have mistyped something; there is no concatenation there.

ysth 2009-10-14 15:49:35

@ysth, I had added the setup stuff before I ran the code. But I think I'm now beginning to see where the problem is:) It's probably not that I mistyped something. Running the current code is giving me the same "uninitialized value in concatenation or string at Line 16" error. The version of my activeperl is 5.8.8. Gotta upgrade it to 5.010.

Mike 2009-10-15 00:48:02

@ysth, I've now upgraded ActivePerl to 5.10.1. The code does its job perfectly :) It's a neat code, much better than my first work. But could you please provide me with comments starting from the while(1) line, if possible. I think I need some explantion to understand the code. Some stuff say the // symbols are new stuff for me.

Mike 2009-10-15 04:17:42

@ysth, I'm reading the comments. Thanks alot!!

Mike 2009-10-15 12:41:22

Thanks. It's great to have the comments.

Mike 2009-10-15 13:18:39

Answer 2

+3 A:

Have you gotten to the part of Learning Perl where you learn about hashes? You could load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in your hash table.

mobrule 2009-10-14 05:29:36

This seems like the most natural solution to me. Replace the bit where you slurp into @filter with a block that reads each line of f1, setting $filter{$_} = 1, or somesuch, then later check candidates against that hash, like if($filter{$candidate} == 1) { print LEARN $candidate; }.

fatcat1111 2009-10-14 05:41:24

@mobrule, thanks for the pointer. This approach seems much better. Actually I've already roughly browsed through some pages about hashes, but don't quite grasp the concept yet. I'll try a little harder.

Mike 2009-10-14 06:04:33

Answer 3

+8 A:

You're making the problem harder than it needs to be by thinking about all of it at once rather than breaking it down into manageable bits.

It doesn't look like you need regexes here. You just need to see if the term in the first column was in the list:

open my($patterns), '<', 'patterns.txt' or die "Could not get patterns: $!"; 

my %hash = map { my $p = $_; chomp $p; $p, 1 } <$patterns>;

open my($lines), '<', 'file.txt' or die "Could not open file.txt: $!";

while ( <$lines> ) {
 my( $term ) = split /\s+/, $_, 2;
 print if exists $hash{$term};
 }

If you really needed regular expressions to find the terms, you might be able to get away with just grep:

 grep -f patterns.txt file.txt

brian d foy 2009-10-14 05:49:33

Well, I guess this is something about hashes, which I don't quite grasp yet. But thanks alot, brian! Also want to thank you again for the solution to the previous post of mine:) Well, I'm saving the code for later study. Gotta read the hashes section in Learning Perl carefully.

Mike 2009-10-14 07:02:43

@Mike, if there's a chance that the item being looked for isn't in the file of definitions, you can delete the items from the hash as they're found, and then report on what's still left in the hash after the search.

Joe 2009-10-14 17:47:11

@brian, I'm now trying to understand your code since now I already have learnt something abut hashes. Your code does the job like a magic. But there are a lot of stuff I don't follow. I was wondering if you could possibly provide me with some explanation of each line. I've also upgraded my code but I've found the pattern I used in the map function is crippled. It won't work with different definition file.

Mike 2009-10-15 04:33:20

Having taken another look at your code, I think it is very much like the upgraded code of mine. I don't quite understand the functions of "my $p = $_; chomp $p; $p, 1" and of "split /\s+/, $_, 2;".

Mike 2009-10-15 04:43:00

@brian, I figured out by myself what "map { my $p = $_; chomp $p; $p, 1 } <$patterns>;" does. This line changes each line of "patterns.txt" to look something like word1 => 1 word2=>1 word3=>1----

Mike 2009-10-15 07:49:01

@brian, so this is something like "load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in the hash table"? in this case, this is different from the upgraded code of mine, which basically loads the contents of FILE2 into a hash and then read through FILE2 line by line.

Mike 2009-10-15 07:53:47

@brian, um.. I see the "split /\s+/, $_, 2;" line picks the first word of each line of the definition file. Ahh....I finally figured out all by myself what each line of your code does :) Thanks for sharing!

Mike 2009-10-15 08:02:57

I figured you'd suss it out on your own :)

brian d foy 2009-10-15 13:26:51

Answer 4

+3 A:

If you don't actually have to use Perl, (and you have cygwin or something else unixy installed), you can just do grep -f new_word.txt dic.txt. But let's assume you want to learn something about Perl here.. :)

use strict and use warnings are invaluable for spotting problems (and for teaching good habits). Remember that if you're unsure what a warning message means, you can look it up in perldoc perldiag.

Regarding your comment "Dunno why when I replace dic.txt with $dic in the death note, I'll receive "needs explicit package name" warning. Any ideas?" -- $dic is not a filename, but a file handle, and is not something you generally want to print out. To avoid using the filename twice (say, to make it easier to change later), define it at the top of the file, as I have done.

Using subroutines to advance the position in each file feels a little crude, but this algorithm only loops through each file once, and does not read either file into memory, so it will work even for huge input files. (This hinges on both files being sorted, which they appear to be in the example you provide.)

Code edited and fixed. I shouldn't have banged off a version just before bed and then not tested it (I blame the spouse) :D

use warnings;
use strict;

my $dictFile = 'dict.txt';
my $wordsFile = 'words.txt';
my $outFile = 'out.txt';

open my $dic, '<', $dictFile or die "Cannot open $dictFile: $!";
open my $filter, '<', $wordsFile or die "Cannot open $wordsFile: $!";
open my $learn, '>', $outFile or die "Cannot open $outFile: $!";

# create variables before declaring subs, which creates closures
my ($word, $key, $sep, $definition);
sub nextWord {
    $word = <$filter>;
    done() unless $word;
    chomp $word;
};
sub nextEntry {
    # use parens around pattern to capture it into the list for later use
    ($key, $sep, $definition) = split(/(\s+)/, <$dic>, 2);
    done() unless $key;
}
sub done
{
    close $filter or warn "can't close $wordsFile: $!";
    close $dic or warn "can't close $dictFile: $!";
    close $learn or warn "can't close $outFile: $!";
    exit;
}

nextWord();
nextEntry();

# now let's loop until we hit the end of one of the input files
for (;;)
{
    if ($word lt $key)
    {
        nextWord();
    }
    elsif ($word gt $key)
    {
        nextEntry();
    }
    else    # word eq $key
    {
        # newline is still in definition; no need to append another
        print $learn ($key . $sep . $definition);
        nextWord();
        nextEntry();
    }
}

Ether 2009-10-14 06:27:52

@Ether, thanks! I'm saving the code for later study. BTW, I haven't learnt how to use SUB yet. I've been trying to use what I've learnt from the solutions to my previous posts. Touched something upon SUB in early chapter of Learning Perl, though but gotta relearn it with more real-world examples like this very one. Thanks again.

Mike 2009-10-14 06:55:43

@Ether: you probably want to work on your code some until it works and doesn't give errors. Hint: you will need a loop, and some way for the variables defined in your subs to be used outside them.

ysth 2009-10-14 08:24:03

My bad for posting code just before bed.. :)

Ether 2009-10-14 16:17:45

@Ether, thanks for upgrading the code :) um..seems there's something wrong. Nothing happens after running the code. I'm using this configuration: "my $dictFile = 'c:/FILE2.txt';my $wordsFile = 'c:/FILE1.txt';my $outFile = 'c:/FILE3.txt';"

Mike 2009-10-15 00:54:58

@Ether, it works!

Mike 2009-10-15 01:38:33

Now it's working great! Thanks :)

Mike 2009-10-15 01:40:54

@Ether, thanks, I've read the code from start to finish twice. It has a lot of subs but I think I'm able to completely understand each line of it. It's quite readable especially since now I've improved the stock of my Perl knowledge :) Thanks.

Mike 2009-10-15 13:12:32

Answer 5

+2 A:

It seems reasonable to me to assume that the number of words to look up will be small relative to the size of the dictionary. Therefore, you can read FILE1.txt into memory, putting each word into a hash.

Then, read the dictionary, outputting only the lines where the term is in the hash. I would also output to STDOUT which can then be redirected from the command line to any file you want.

#!/usr/bin/perl

use strict; use warnings;
use autodie qw(open close);

my ($words_file, $dict_file) = @ARGV;

my %words;
read_words(\%words, $words_file);

open my $dict_fh, '<', $dict_file;

while ( my $line = <$dict_fh> ) {
    # capturing match in list context returns captured matches
    if (my ($term) = ($line =~ /^(\w+)\s+\w/)) {
        print $line if exists $words{$term};
    }
}

close $dict_fh;

sub read_words {
    my ($words, $filename) = @_;

    open my $fh, '<', $filename;
    while ( <$fh> ) {
        last unless /^(\w+)/;
        $words->{$1} = undef;
    }
    close $fh;
    return;
}

Invocation:

C:\Temp> lookup.pl FILE1.txt FILE2.txt > FILE3.txt

Output:

C:\Temp> type FILE3.txt
azure         adj. bright blue, as of the sky
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
dyspeptic     adj. suffering from dyspepsia
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle

Sinan Ünür 2009-10-14 10:28:00

@Sinan, thanks. I'll try to comment on the code after I finish the hashes section of Learning Perl.

Mike 2009-10-14 10:57:24

@Sinan, I'm not sure but what this line of code does: $words ->{$1} = undef. And also this line: my ($words, $filename) = @_;

Mike 2009-10-15 12:06:02

@Sinan, is "if (my ($term) = ($line =~ /^(\w+)\s+\w/))" short for "if ($line=~/^(\w+)\s+\w/) {my $term = $line; ...} "?

Mike 2009-10-15 12:13:02

@Mike `$words->{$1} = undef` sets the first captured match as an element of the hashref pointed to by `$words` with an undefined value. I do not care about the values in the hash, I just want to be able to quickly check if a word whose definition I have is in the list of definitions I want to print (i.e. is it one of the keys of `%$words`?)

Sinan Ünür 2009-10-15 12:16:39

@Mike Regarding: `my ($term) = ($line =~ /^(\w+)\s+\w/)`. Capturing matches in list context return the matches. Therefore, it is a quick way to assign the captured word at the beginning of the line to `$term`.

Sinan Ünür 2009-10-15 12:17:56

$line =~ /^(\w+)\s+\w/)) means if any line of dictionary file begins with one or more letters followed by one or more spaces and then at least one letter?

Mike 2009-10-15 12:20:00

Ah, yes, I see what $1 means

Mike 2009-10-15 12:20:40

$1 is sort of like the beginning word of each line of dictionary file

Mike 2009-10-15 12:21:38

This is clever :)

Mike 2009-10-15 12:22:31

Stupid me. I should've known this. $1 $2 are captured matches and can be written as \1 \2. And actually you had provided a comment

Mike 2009-10-15 12:24:31

So this line "last unless /^(\w+)/;" means read line by line but ignores any line that does start with one or more letters.

Mike 2009-10-15 12:32:36

Yes I see....so my $term is not $line, but $1. This makes sense.

Mike 2009-10-15 12:35:11

ansaurus

tags:

views:

answers:

How can I take queries from one file, search another, and output to a third, in Perl?

related questions