views:

54

answers:

1

Hi, I've a problem in making a PERL program for matching the words in two documents. Let's say there are documents A and B.

So I want to delete the words in document A that's not in the document B.

Example 1:

A: I eat pizza

B: She go to the market and eat pizza

result: eat pizza

example 2: A: eat pizza

B: pizza eat

result:pizza (the word order is relevant, so "eat" is deleted.)

I use Perl for the system and the sentences in each document isn't in a big numbers so I think I won't use SQL

And the program is a subproram for automatic essay grading for Indonesian Language (Bahasa)

Thanx, Sorry if my question is a bit confusing. I'm really new to 'this world' :)

+1  A: 

OK, I'm without access at the moment so this is not guaranteed to be 100% or even compile but should provide enough guidance:

Solution 1: (word order does not matter)

#!/usr/bin/perl -w

use strict;
use File::Slurp;

my @B_lines = File::Slurp::read_file("B") || die "Error reading B: $!";
my %B_words = ();
foreach my $line (@B_lines) {
    map { $B_words{$_} = 1 } split(/\s+/, $line);
}
my @A_lines = File::Slurp::read_file("A") || die "Error reading A: $!";
my @new_lines = ();
foreach my $line (@A_lines) {
    my @B_words_only = grep { $B_words{$_} } split(/\s+/, $line);
    push @new_lines, join(" ", @B_words_only) . "\n";
}
File::Slurp::write_file("A_new", @new_lines) || die "Error writing A_new: $!";

This should create a new file "A_new" that only contains A's words that are in in B.

This has a slight bug - it will replace any multiple-whitespace in file A with a single space, so

    word1        word2              word3

will become

word1 word2 word3

It can be fixed but would be really annoying to do so, so I didn't bother unless you will absolutely require that whitespace be preserved 100% correctly

Solution 2: (word order matters BUT you can print words from file A out with no regards for preserving whitespace at all)

#!/usr/bin/perl -w

use strict;
use File::Slurp;

my @A_words = split(/\s+/gs, File::Slurp::read_file("A") || die "Error reading A:$!");
my @B_words = split(/\s+/gs, File::Slurp::read_file("B") || die "Error reading B:$!");
my $B_counter = 0;
for (my $A_counter = 0; $A_counter < scalar(@A_words); ++$A_counter) {
    while ($B_counter < scalar(@B_words)
        && $B_words[$B_counter] ne $A_words[$A_counter]) {++$B_counter;}
    last if $B_counter == scalar(@B_words);
    print "$A_words[$A_counter]";
}

Solution 3 (why do we need Perl again? :) )

You can do this trivially in shell without Perl (or via system() call or backticks in parent Perl script)

comm -12 A B | tr "\012" " " 

To call this from Perl:

my $new_text = `comm -12 A B | tr "\012" " " `;

But see my last comment why this may be considered "bad Perl"... at least if you do this in a loop with very many files being iterated and care about performance.

DVK
OK, I just saw your second example and will try to fix for that... it's a bit more complicated this way if the word order matters
DVK
Ha3..sorry for the edit..It's a bit confusing since my first time using Perl but a big thanks for the reply.. :)
Randy
@Randy - please see my question in the comment. Do you really care about how the common words are output?
DVK
No, if the question is about the line, I've made it so the document just have one line.
Randy
emm..the common words or stopwords in the sentence has been removed, so it's just he important words left
Randy
@Randy - I mean, if the answer will be 1 word per line, is that OK?
DVK
@DVK: yes..it's ok.
Randy
@Randy - OK, see my solution #2 for Perl version and #3 for shell command... the latter's a lot more consise but is not Good Perl Practice as it will spawn off 2 separate child processes which is bad for performance if it happens many times in a loop.
DVK
@DVK..you code really fast..it's really a long way to go for me.. :)anw, I'll implement it to my program and give the news soon. Thank youRandy
Randy
@Randy - artifact of having competed in programming contests... but in this case I'd say hold the compliments till you verify that the code actually works since I coulnd't test it :)
DVK
@DVK it works..thanks!I modify it a little so I can use it with Matlab.. :) thanks again
Randy
@Randy - you're welcome. Feel free to indicate whether the answer was helpful by StackOPverflow standard methods: (1) Up-voting the answer (up-arrow next to it) and "accepting" (Checkmark next to the asnwer). Cheers, and welcome to wonderful world of Perl, where possible things are easy and impossible things are doable :)
DVK