tags:

views:

184

answers:

5

Edit: My original title has been sort of changed. I suspect the current title does not reveal my original purpose: let perl automatically use the contents of one file as the source of search keywords to search another file and then output the matches to a third file. This means without this kind of automation, I would have to manually type those query terms that are listed in FILE1 one by one and get matches from FILE2 one at a time by simply writing something like "while(<FILE2>){if (/query terms/){print FILE3 $_}}".

To be more specific, FILE1 should look something like this:

azure
Byzantine
cystitis
dyspeptic
eyrie
fuzz

FILE2 might (or might not) look something like this:

azalea        n.  flowering shrub of the rhododendron family
azure         adj. bright blue, as of the sky 
byte          n. fixed number of binary digits, often representing a single character
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
Czech         adj. of the Czech Republic or Bohemia
dyslexic      adj. suffering from dyslexia
dyspeptic     adj. suffering from dyspepsia
eyelet        n. small hole in cloth, in a sail, etc for a rope, etc to go through; 
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle
fuzzy         adj. like fuzz

FILE3 should look something like this if FILE2 is the way it is like above:

azure         adj. bright blue, as of the sky 
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
dyspeptic     adj. suffering from dyspepsia
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle

It took me hours of trial and error to finally figure out a seemingly working solution but my code is probably buggy, not to mention inefficient. I hope you guys can send me on the right track if I'm wrong, kindly offer me some guidance and share with me some different approaches to the problem if any (Well, there must be). As suggested by daotoad, I'm trying to comment out what each line of code does. Please correct me if I misunderstand something. Thanks :)

#!perl  #for Windows, simply perl suffices. I'm reading *Learning Perl*.    
use warnings; #very annoying I've always been receiving floods of error messages
use strict;   #I often have to look here and there because of my carelessness

open my $dic,'<', 'c:/FILE2.txt' or die "Cannot open dic.txt ;$!"; # 3-argument version of open statement helps avoid possible confusion; Dunno why when I replace dic.txt with $dic in the death note, I'll receive "needs explicit package name" warning. Any ideas?
open my $filter,'<','c:/FILE1.txt' or die "Cannot open new_word.txt :$!"; 
my @filter=<$filter>; #store the entire contents of FILE1 into @filter.
close $filter;        #FILE1 is useless so close the connection between FILE1 and perl
open my $learn,'>','c:/FILE3.txt'; #This file is where I output matching lines.
my $candidate="";     #initialize the candidate to empty string. It will be used to store matching lines. Learnt this from Jeff.

while(<$dic>){    #let perl read the contents of FILE2 line by line.
for (my $n=0; $n<=$#filter; $n++){ #let perl go through each line of FILE1 too
my $entry = $filter[$n];
chomp($entry);   #Figured out this line must be added after many fruitless attempts
if (/^$entry\s/){  #let perl compare each line of FILE2 with any line of FILE1.
$candidate.= $_ ; } #every time a match is found, store the line into $candidate
}
}
print $learn $candidate; #output the results to FILE3

UPGRADE1:

Thank you very much for the guidance! I truly appreciate it :)

I believe I'm now going in a somewhat different direction as I originally intended. The concept of hashes was beyond the then stock of my Perl knowledge. Having finished the hashes section of learning Perl, I'm now thinking: although the use of hashes may effiently solve the example problem I posted above, situations might get complicated if the headwords (not the whole entry) in the definition file (FILE2) have duplicates. But on the other hand, I see hashes are very important in programming in Perl. So this morning I tried to implement @mobrule's idea: load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in your hash table.. But then I decided I should load FILE2 into a hash instead of FILE1 because FILE2 contains dictionary entries and it is meaningful to treat HEADWORDS as KEYS and DEFINITIONS as VALUES. Now I came up with the following code. It seems close to success.

#!perl

open my $learn,'>','c:/file3.txt' or die "Cannot open Study Note;$!";
open my $dic,"<",'c:/file2.txt' or die "Cannot open Dictionary: $!";
my %hash = map {split/\t+/} <$dic>; # #I did some googling on how to load a file into a hash and found this works. But actually I don't quite understand why. I figured the pattern out by myself. /\t+/ seems to be working because the headwords and the main entries in FILE2 are separated by tabs.  

open my $filter,'<','c:/file1.txt' or die "Cannot open Glossary: $!";
while($line=<$filter>){
chomp ($line);
if (exists $hash{$line}){
print "$learn $hash{$line}"; # this line is buggy. first it won't output to FILE3. second, it only prints the values of the hash but I want to include the keys.
}
}

The code output the following results on screen:

GLOB(0x285ef8) adj. bright blue, as of the sky
GLOB(0x285ef8) adj. of Byzantium or the E Roman Empire
GLOB(0x285ef8) n. inflammation of the bladder
GLOB(0x285ef8) adj. suffering from dyspepsia
GLOB(0x285ef8) n. eagle's nest
GLOB(0x285ef8) n. mass of soft light particle

UPGRADE2

One problem solved. I can print both keys and values now by doing a minor modification of the last line.

print "$learn $line: $hash{$line}";

UPGRADE3

Haha: I made it! I made it :) modified the code again and now it outputs stuff to FILE3!

#!perl

open my $learn,'>','c:/file3.txt' or die $!;
open my $dic,"<",'c:/file2.txt' or die $!;
my %hash = map {split/\t+/} <$dic>; #the /\t+/ pattern works because the entries in my FILE2 are separated into the headwords and the definition by two tab spaces. 

open my $filter,'<','c:/file1.txt' or die $!;
while($line=<$filter>){
chomp ($line);
if (exists $hash{$line}){
print $learn "$line: $hash{$line}";
}
}

UPGRADE4

I'm thinking if my FILE2 has totally different contents, say, sentences that contain query words in FILE1, it will be difficult, if not impossible, for us to use the hash approach, right?

UPGRADE5

Having carefully read the perlfunc page about the split operator, now I know how to improve my code :)

#!perl

    open my $learn,'>','c:/file3.txt' or die $!;
    open my $dic,"<",'c:/file2.txt' or die $!;
    my %hash = map {split/\s+/,$_,2} <$dic>; # sets the limit of separate fields to 2
    open my $filter,'<','c:/file1.txt' or die $!;
    while($line=<$filter>){
    chomp ($line);
    if (exists $hash{$line}){
    print $learn "$line: $hash{$line}";
    }
    }
+1  A: 

Are FILE1 and FILE2 initially sorted? If so, you only need a single loop, not a nested one:

use 5.010;
use warnings;
use strict;

my $dictFile = 'c:/FILE2.txt';
my $wordsFile = 'c:/FILE1.txt';
my $outFile = 'c:/FILE3.txt';

open my $dic, '<', $dictFile or die "Cannot open $dictFile: $!";
open my $filter, '<', $wordsFile or die "Cannot open $wordsFile: $!";
open my $learn, '>', $outFile or die "Cannot open $outFile: $!";

my $dic_line;
my $dic_word;
my $filter_word;

# loop forever (or until last'ing out of the loop, anyway)
while (1) {
    # if we don't have a word from the filter list, get one
    if ( ! defined $filter_word ) {
        # get a line from the filter file, bailing out of the loop if at the end
        $filter_word = <$filter> // last;
        # remove the newline so we can string compare
        chomp($filter_word);
    }
    # if we don't have a word from the dictionary, get one
    if ( ! defined $dic_line ) {
        # get a line from the dictionary, bailing out of the loop if at the end
        $dic_line = <$dic> // last;
        # get the first word on the line
        ($dic_word) = split ' ', $dic_line;
    }
    # if we have a match, print it
    if ( $dic_word eq $filter_word ) { print $learn $dic_line }
    # only keep considering this dictionary line if it is beyond the filter word we had
    if ( lc $dic_word le lc $filter_word ) { undef $dic_line }
    # only keep considering this filter word if it is beyond the dictionary line we had
    if ( lc $dic_word ge lc $filter_word ) { undef $filter_word }
}
ysth
@ysth, yes, in my example, they are initially sorted. I invented the the problem to solve so that I could improve my Perl knowledge. But I don't know how to avoid nested loop. Can I?
Mike
@ysth, thanks. But running the code gives me "uninitialized in concatenation" error.
Mike
@Mike: on what line? I left out the setup stuff; will add that in case you did something differently.
ysth
@Mike: I have the feeling you have mistyped something; there is no concatenation there.
ysth
@ysth, I had added the setup stuff before I ran the code. But I think I'm now beginning to see where the problem is:) It's probably not that I mistyped something. Running the current code is giving me the same "uninitialized value in concatenation or string at Line 16" error. The version of my activeperl is 5.8.8. Gotta upgrade it to 5.010.
Mike
@ysth, I've now upgraded ActivePerl to 5.10.1. The code does its job perfectly :) It's a neat code, much better than my first work. But could you please provide me with comments starting from the while(1) line, if possible. I think I need some explantion to understand the code. Some stuff say the // symbols are new stuff for me.
Mike
@ysth, I'm reading the comments. Thanks alot!!
Mike
Thanks. It's great to have the comments.
Mike
+3  A: 

Have you gotten to the part of Learning Perl where you learn about hashes? You could load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in your hash table.

mobrule
This seems like the most natural solution to me. Replace the bit where you slurp into @filter with a block that reads each line of f1, setting $filter{$_} = 1, or somesuch, then later check candidates against that hash, like if($filter{$candidate} == 1) { print LEARN $candidate; }.
fatcat1111
@mobrule, thanks for the pointer. This approach seems much better. Actually I've already roughly browsed through some pages about hashes, but don't quite grasp the concept yet. I'll try a little harder.
Mike
+8  A: 

You're making the problem harder than it needs to be by thinking about all of it at once rather than breaking it down into manageable bits.

It doesn't look like you need regexes here. You just need to see if the term in the first column was in the list:

open my($patterns), '<', 'patterns.txt' or die "Could not get patterns: $!"; 

my %hash = map { my $p = $_; chomp $p; $p, 1 } <$patterns>;

open my($lines), '<', 'file.txt' or die "Could not open file.txt: $!";

while ( <$lines> ) {
 my( $term ) = split /\s+/, $_, 2;
 print if exists $hash{$term};
 }

If you really needed regular expressions to find the terms, you might be able to get away with just grep:

 grep -f patterns.txt file.txt
brian d foy
Well, I guess this is something about hashes, which I don't quite grasp yet. But thanks alot, brian! Also want to thank you again for the solution to the previous post of mine:) Well, I'm saving the code for later study. Gotta read the hashes section in Learning Perl carefully.
Mike
@Mike, if there's a chance that the item being looked for isn't in the file of definitions, you can delete the items from the hash as they're found, and then report on what's still left in the hash after the search.
Joe
@brian, I'm now trying to understand your code since now I already have learnt something abut hashes. Your code does the job like a magic. But there are a lot of stuff I don't follow. I was wondering if you could possibly provide me with some explanation of each line. I've also upgraded my code but I've found the pattern I used in the map function is crippled. It won't work with different definition file.
Mike
Having taken another look at your code, I think it is very much like the upgraded code of mine. I don't quite understand the functions of "my $p = $_; chomp $p; $p, 1" and of "split /\s+/, $_, 2;".
Mike
@brian, I figured out by myself what "map { my $p = $_; chomp $p; $p, 1 } <$patterns>;" does. This line changes each line of "patterns.txt" to look something like word1 => 1 word2=>1 word3=>1----
Mike
@brian, so this is something like "load the contents of FILE1 into a hash and then check whether the first word of each line in FILE2 was in the hash table"? in this case, this is different from the upgraded code of mine, which basically loads the contents of FILE2 into a hash and then read through FILE2 line by line.
Mike
@brian, um.. I see the "split /\s+/, $_, 2;" line picks the first word of each line of the definition file. Ahh....I finally figured out all by myself what each line of your code does :) Thanks for sharing!
Mike
I figured you'd suss it out on your own :)
brian d foy
+3  A: 

If you don't actually have to use Perl, (and you have cygwin or something else unixy installed), you can just do grep -f new_word.txt dic.txt. But let's assume you want to learn something about Perl here.. :)

use strict and use warnings are invaluable for spotting problems (and for teaching good habits). Remember that if you're unsure what a warning message means, you can look it up in perldoc perldiag.

Regarding your comment "Dunno why when I replace dic.txt with $dic in the death note, I'll receive "needs explicit package name" warning. Any ideas?" -- $dic is not a filename, but a file handle, and is not something you generally want to print out. To avoid using the filename twice (say, to make it easier to change later), define it at the top of the file, as I have done.

Using subroutines to advance the position in each file feels a little crude, but this algorithm only loops through each file once, and does not read either file into memory, so it will work even for huge input files. (This hinges on both files being sorted, which they appear to be in the example you provide.)

Code edited and fixed. I shouldn't have banged off a version just before bed and then not tested it (I blame the spouse) :D

use warnings;
use strict;

my $dictFile = 'dict.txt';
my $wordsFile = 'words.txt';
my $outFile = 'out.txt';

open my $dic, '<', $dictFile or die "Cannot open $dictFile: $!";
open my $filter, '<', $wordsFile or die "Cannot open $wordsFile: $!";
open my $learn, '>', $outFile or die "Cannot open $outFile: $!";

# create variables before declaring subs, which creates closures
my ($word, $key, $sep, $definition);
sub nextWord {
    $word = <$filter>;
    done() unless $word;
    chomp $word;
};
sub nextEntry {
    # use parens around pattern to capture it into the list for later use
    ($key, $sep, $definition) = split(/(\s+)/, <$dic>, 2);
    done() unless $key;
}
sub done
{
    close $filter or warn "can't close $wordsFile: $!";
    close $dic or warn "can't close $dictFile: $!";
    close $learn or warn "can't close $outFile: $!";
    exit;
}

nextWord();
nextEntry();

# now let's loop until we hit the end of one of the input files
for (;;)
{
    if ($word lt $key)
    {
        nextWord();
    }
    elsif ($word gt $key)
    {
        nextEntry();
    }
    else    # word eq $key
    {
        # newline is still in definition; no need to append another
        print $learn ($key . $sep . $definition);
        nextWord();
        nextEntry();
    }
}
Ether
@Ether, thanks! I'm saving the code for later study. BTW, I haven't learnt how to use SUB yet. I've been trying to use what I've learnt from the solutions to my previous posts. Touched something upon SUB in early chapter of Learning Perl, though but gotta relearn it with more real-world examples like this very one. Thanks again.
Mike
@Ether: you probably want to work on your code some until it works and doesn't give errors. Hint: you will need a loop, and some way for the variables defined in your subs to be used outside them.
ysth
My bad for posting code just before bed.. :)
Ether
@Ether, thanks for upgrading the code :) um..seems there's something wrong. Nothing happens after running the code. I'm using this configuration: "my $dictFile = 'c:/FILE2.txt';my $wordsFile = 'c:/FILE1.txt';my $outFile = 'c:/FILE3.txt';"
Mike
@Ether, it works!
Mike
Now it's working great! Thanks :)
Mike
@Ether, thanks, I've read the code from start to finish twice. It has a lot of subs but I think I'm able to completely understand each line of it. It's quite readable especially since now I've improved the stock of my Perl knowledge :) Thanks.
Mike
+2  A: 

It seems reasonable to me to assume that the number of words to look up will be small relative to the size of the dictionary. Therefore, you can read FILE1.txt into memory, putting each word into a hash.

Then, read the dictionary, outputting only the lines where the term is in the hash. I would also output to STDOUT which can then be redirected from the command line to any file you want.

#!/usr/bin/perl

use strict; use warnings;
use autodie qw(open close);

my ($words_file, $dict_file) = @ARGV;

my %words;
read_words(\%words, $words_file);

open my $dict_fh, '<', $dict_file;

while ( my $line = <$dict_fh> ) {
    # capturing match in list context returns captured matches
    if (my ($term) = ($line =~ /^(\w+)\s+\w/)) {
        print $line if exists $words{$term};
    }
}

close $dict_fh;

sub read_words {
    my ($words, $filename) = @_;

    open my $fh, '<', $filename;
    while ( <$fh> ) {
        last unless /^(\w+)/;
        $words->{$1} = undef;
    }
    close $fh;
    return;
}

Invocation:

C:\Temp> lookup.pl FILE1.txt FILE2.txt > FILE3.txt

Output:

C:\Temp> type FILE3.txt
azure         adj. bright blue, as of the sky
Byzantine     adj. of Byzantium or the E Roman Empire
cystitis      n. inflammation of the bladder
dyspeptic     adj. suffering from dyspepsia
eyrie         n. eagle's nest
fuzz          n. mass of soft light particle
Sinan Ünür
@Sinan, thanks. I'll try to comment on the code after I finish the hashes section of Learning Perl.
Mike
@Sinan, I'm not sure but what this line of code does: $words ->{$1} = undef. And also this line: my ($words, $filename) = @_;
Mike
@Sinan, is "if (my ($term) = ($line =~ /^(\w+)\s+\w/))" short for "if ($line=~/^(\w+)\s+\w/) {my $term = $line; ...} "?
Mike
@Mike `$words->{$1} = undef` sets the first captured match as an element of the hashref pointed to by `$words` with an undefined value. I do not care about the values in the hash, I just want to be able to quickly check if a word whose definition I have is in the list of definitions I want to print (i.e. is it one of the keys of `%$words`?)
Sinan Ünür
@Mike Regarding: `my ($term) = ($line =~ /^(\w+)\s+\w/)`. Capturing matches in list context return the matches. Therefore, it is a quick way to assign the captured word at the beginning of the line to `$term`.
Sinan Ünür
$line =~ /^(\w+)\s+\w/)) means if any line of dictionary file begins with one or more letters followed by one or more spaces and then at least one letter?
Mike
Ah, yes, I see what $1 means
Mike
$1 is sort of like the beginning word of each line of dictionary file
Mike
This is clever :)
Mike
Stupid me. I should've known this. $1 $2 are captured matches and can be written as \1 \2. And actually you had provided a comment
Mike
So this line "last unless /^(\w+)/;" means read line by line but ignores any line that does start with one or more letters.
Mike
Yes I see....so my $term is not $line, but $1. This makes sense.
Mike