ansaurus

Question

Searching one file but displaying relevant content from another using Perl

Answer 1

+2 A:

(This is an answer to part 1 of the question only)

I've actually made a working "translated text search". I just used percentage offsets into the file. This worked for short texts but quickly broke down if the text is of any length.

my $offset = $offset_of_passage_in_text1 * length ($text2)/length ($text1);

The margin of error compared to the length of the text gets bigger and bigger. For a whole book, I don't think that approach has much hope.

One suggestion is to send the second language text to Google translate or just bung it through some kind of s/(\w+)/$dictionary{$1}/ substitution, then search for key words in the translated text to locate the likely position of the translation.

Here is a rough sketch of the code to make this work

open my $dictionary_file, "<:utf8", "name_of_file_containing_English_and_Chinese"
    or die $!;
my %dictionary;
while (<$dictionary_file>) {
     my ($english, $chinese) = split;
     $dictionary{$english} = $chinese;
}
close $dictionary_file or die $!;
my $crude_translation = $english_text;
$crude_translation =~ s/(\w+)/$dictionary{$1}/g;

I haven't tested this. The last line doesn't attempt to catch errors caused by words which are not in the dictionary.

Kinopiko 2009-10-17 04:52:16

Thanks Kinopiko. I'm totally clueless as how to use percentage offsets in Perl. After all I'm a beginner. But I see your point. And I'll try to improve the stock of my Perl knowledge. I also thought of using some kind of keyword translation first and then searching for the possible corresponding sentence. But again at this point it seems quite beyond my ability.

Mike 2009-10-17 05:12:22

I've added some example code.

Kinopiko 2009-10-17 05:25:23

@Kinopiko, thanks a lot for the example code. This shows me the direction :)

Mike 2009-10-17 06:00:35

Answer 2

+2 A:

To avoid the warning, you have to check whether or not $n is defined():

if(defined $n) {
  open my $eng,'<',$file2;
  { local $/="\n\n";
    <$eng> while --$n;
    print scalar <$eng>;
    close $eng;
  }
} else {
  print "No match found!\n";
}

I also rewrote the part that reads English. Rather than reading the entire file in and only using one line of it, it reads in a $n - 1 lines and throws them away, and then prints the next line (for real this time) it reads. This should have the same effect, but with a lower memory impact on large files. (If it doesn't, it's probably an off-by-one error because I'm tired.)

EDIT: It turns out this introduced a subtle bug. Your code to find the matching line does the same thing: slurps the file into an array, then finds the array index that matches. Let's convert this code to read line-by-line so that we don't get huge memory consumption issues:

open my $fr,'<', $file1;
{ local $/="\n\n"; 
  while(<$fr>) {
    $n = $. if /$query/i;
  }
}

I think you understand most of that: while(<$fr>) reads line-by-line from $fr and sets each line to $_ for the loop iteration, /$query/i will implicitly match against $_ (which is what we want), but you're probably curious about this little bugger: $n = $.. From perldoc perlvar:

HANDLE->input_line_number(EXPR)

$INPUT_LINE_NUMBER

$NR

$.

Current line number for the last filehandle accessed.

Each filehandle in Perl counts the number of lines that have been read from it. (Depending on the value of $/ , Perl's idea of what constitutes a line may not match yours.) When a line is read from a filehandle (via readline() or <> ), or when tell() or seek() is called on it, $. becomes an alias to the line counter for that filehandle.

You can adjust the counter by assigning to $. , but this will not actually move the seek pointer. Localizing $. will not localize the filehandle's line count. Instead, it will localize perl's notion of which filehandle $. is currently aliased to.

$. is reset when the filehandle is closed, but not when an open filehandle is reopened without an intervening close(). For more details, see "I/O Operators" in perlop. Because <> never does an explicit close, line numbers increase across ARGV files (but see examples in eof).

You can also use HANDLE->input_line_number(EXPR) to access the line counter for a given filehandle without having to worry about which handle you last accessed.

(Mnemonic: many programs use "." to mean the current line number.)

So if we found a match in your third paragraph, $. would be 3. As a general recommendation, read through the perlvar page every once in a while. There are some gems in there, and even if you don't understand what everything is for, you'll get it on a reread.

However, the final thing I have to say is that mobrule's advice about explicitly storing paragraph information is probably the best way to go. I might shy away from a homemade format, but I understand if XML or something is a little to heavyweight for your purposes. (Just know that your purposes are likely to expand greatly if you're not careful).

Chris Lutz 2009-10-17 04:54:40

`print <$eng>` evaluates `<$eng>` in list context. Try `print scalar <$eng>`

mobrule 2009-10-17 05:04:02

Crap! Brains not working tonight.

Chris Lutz 2009-10-17 05:06:55

Chris, thanks for the fix :)

Mike 2009-10-17 05:13:08

Now the technical warning is gone :)

Mike 2009-10-17 05:18:13

@mobrule, thanks for the suggestion :)

Mike 2009-10-17 05:27:38

Now I know how to get rid of the unfriendly warning message but have trouble making the improved code work like the way it should. I don't quite understand what the following line does: <$eng> while --$n

Mike 2009-10-17 05:36:46

@Chris, these three lines "<$eng> while --$n; print scalar <$eng>; close $eng;" are not working. For example, using my example files posted above, when I search for "elle est", the output on screen is "Verrieres is sheltered ... in Verrieres", which is not right.

Mike 2009-10-17 11:47:40

@Mike - What it does is create a loop. `while --$n` will subtract 1 from `$n` and then, if `$n` is not zero, will execute the loop body. The loop body is `<$eng>` which reads a line from the file and throws it away. The idea is that if `$n` is the line number you want read, we read `$n - 1` lines from the file, so that the _next_ line that we read is the line we really want. The problem is in the indexing code. The indexing code assigns an _array index_ to `$n` and this code treats it as a _line number_ (which most people consider as starting at 1). Updating answer.

Chris Lutz 2009-10-17 19:53:59

@Chris, thanks for the clarifications and the fix. Yes, I see it was the indexing number that caused the problem.

Mike 2009-10-18 05:48:16

Answer 3

+2 A:

mobrule 2009-10-17 05:21:26

@mobrule, thanks, I've just tested your fix to modify the warning message. It works fine except for one thing. When I applied Chris' fix, the warning message becomes exactly "Oops, no match is found!" but this code gives me something else. Like, using the example files I post, if I search for "admin", I will receive a warning messsage "Oops, no match is found. Chapter 1".

Mike 2009-10-17 11:53:54

Answer 4

+1 A:

Here's a different approach for you to consider:

use strict;
use warnings;
use File::Slurp qw(read_file);

my %para = map { $_ => Read_paragraphs("$_.txt") } qw(FR EN);

my $query = 'La petite ville de';
my @matches = 
    map  { $para{EN}[$_] }
    grep { $para{FR}[$_] =~ /$query/ }
    0 .. @{$para{FR}} - 1
;

print $_, "\n" for @matches;

sub Read_paragraphs {
    return [split /\n{2,}/, read_file(shift)];
}

FM 2009-10-17 14:40:28

Why are you using `join()` to slurp a file? `read_file()` will return the entire file as a scalar if you just say `$content = read_file(shift);` and it'll be more efficient because it won't have to split the file up and then rejoin the pieces.

Chris Lutz 2009-10-17 19:57:13

@Chris Good to know. Thanks.

FM 2009-10-17 21:27:48

@FM, thanks for sharing the code! I changed qw(FR EN) to qw(c:/FR c:/EN) and then ran the code but it gave me the warning saying "Can't use an undefined value as an ARRAY reference at c:/test.pl line 11.

Mike 2009-10-18 06:24:04

@Mike `FR` and `EN` are the two hash keys. If you change them in one spot, you'll need to change them everywhere. Rather than doing that, perhaps a simpler way to make the adjustment you need is like this: `Read_paragraphs("C:/$_.txt") } qw(FR EN)`. That way, the hash keys remain short, and the full path information is used only where it's needed (the subroutine call to read the files).

FM 2009-10-18 13:35:29

@FM, thanks for the code and the explanation :)

Mike 2009-10-18 14:34:59

ansaurus

tags:

views:

answers:

Searching one file but displaying relevant content from another using Perl

related questions