tags:

views:

135

answers:

4

Here's the situation:

I've got two versions of a novel, both in txt format. One is in the original language and the other in Chinese or English translation.When reading the original version, it sometimes happens I want to take a quick look at the translated version of a particular sentence. What I expect is: a corresponding sentence from the translated version directly pops into my eyes when I type that particular sentence in its original language.

Here's my approach:

My orginal thinking was that since Perl knows the position of the line that matches the query (#learnt this from Chris' solution to my second post), all I need to do is let Perl use that position information to display the content of another file. But then I realized shifting from one language to another is way more complicated. One single line of content in one language may turn out to be two or even three lines in another language and the difference will build up. Then I figured brian's solution to my third question seems to be useful again. One paragraph of content in one language is likely to be contained in equally one paragraph when translated. I can just let Perl treat a paragraph as a line. Now I've come with the following code.

Here's my code:

#! perl

use warnings; use strict; 
use autodie; 
my $n;

my $file1 = "c:/FR.txt";
my $file2 = "c:/EN.txt";

print "INPUT YOUR QUERY:";
chomp(my $query=<STDIN>);

open my $fr,'<', $file1;
{ local $/="\n\n"; #learnt from brians's solution to [my 3rd question][1]  

my @fr = <$fr>;
close $fr;

for (0 .. $#fr) {   #learnt from Chris' solution to [my 2nd question][2]    

    if ($fr[$_] =~ /$query/i){
$n = $_;
}
}
}

open my $eng,'<',$file2;
{ local $/="\n\n";
my @eng = <$eng>;
close $eng;
print $eng[$n];
}

Questions are here:

1: Is this a good approach to the problem?

2: When no match is found, I will receive a warning message saying something like "Use of uninialized value" etc.. Well,it's technical and yes I know the meaning. But is it possible to change this message to something like "Oops, no match is found"?

The test files are something like:

file1

Chapitre premier

Une petite ville

La petite ville de Verrières peut passer pour l’une des plus jolies de la 

Franche-Comté....Espagnols, et maintenant ruinées.

Verrières est abrité du ... 
depuis la chute de Napoléon
 ...de presque toutes les maisons de Verrières.

à peine entre-t-on dans la ville ...
...
Eh ! elle est à M. le maire.

file2

CHAPTER 1

A Small Town

The small town of Verrieres may be regarded as one of the most
attractive....and now in
ruins.

Verrieres is sheltered ... since the fall of Napoleon, has led to the refacing
of almost all the houses in Verrieres.

No sooner has one entered the town ...Eh! It belongs to the Mayor.

If "La petite ville de" is searched, the output on screen should be:

The small 
town of Verrieres may be regarded as one of the most
attractive....and now in
ruins.

Thanks like always for any comments whatsoever :)

UPDATE1

Thanks for all the help!

Now question 2 can be solved with a few minor modifications like Chris has suggested:

if(defined $n) {
  open my $eng,'<',$file2;
  { local $/="\n\n";
    my @eng = <$eng>;
close $eng;
print $eng[$n];
}
} else {
  print "Oops, no match found!\n";
}

UPDATE2

Chris' code should run much faster than mine when dealing a huge file.

+2  A: 

(This is an answer to part 1 of the question only)

I've actually made a working "translated text search". I just used percentage offsets into the file. This worked for short texts but quickly broke down if the text is of any length.

my $offset = $offset_of_passage_in_text1 * length ($text2)/length ($text1);

The margin of error compared to the length of the text gets bigger and bigger. For a whole book, I don't think that approach has much hope.

One suggestion is to send the second language text to Google translate or just bung it through some kind of s/(\w+)/$dictionary{$1}/ substitution, then search for key words in the translated text to locate the likely position of the translation.

Here is a rough sketch of the code to make this work

open my $dictionary_file, "<:utf8", "name_of_file_containing_English_and_Chinese"
    or die $!;
my %dictionary;
while (<$dictionary_file>) {
     my ($english, $chinese) = split;
     $dictionary{$english} = $chinese;
}
close $dictionary_file or die $!;
my $crude_translation = $english_text;
$crude_translation =~ s/(\w+)/$dictionary{$1}/g;

I haven't tested this. The last line doesn't attempt to catch errors caused by words which are not in the dictionary.

Kinopiko
Thanks Kinopiko. I'm totally clueless as how to use percentage offsets in Perl. After all I'm a beginner. But I see your point. And I'll try to improve the stock of my Perl knowledge. I also thought of using some kind of keyword translation first and then searching for the possible corresponding sentence. But again at this point it seems quite beyond my ability.
Mike
I've added some example code.
Kinopiko
@Kinopiko, thanks a lot for the example code. This shows me the direction :)
Mike
+2  A: 

To avoid the warning, you have to check whether or not $n is defined():

if(defined $n) {
  open my $eng,'<',$file2;
  { local $/="\n\n";
    <$eng> while --$n;
    print scalar <$eng>;
    close $eng;
  }
} else {
  print "No match found!\n";
}

I also rewrote the part that reads English. Rather than reading the entire file in and only using one line of it, it reads in a $n - 1 lines and throws them away, and then prints the next line (for real this time) it reads. This should have the same effect, but with a lower memory impact on large files. (If it doesn't, it's probably an off-by-one error because I'm tired.)

EDIT: It turns out this introduced a subtle bug. Your code to find the matching line does the same thing: slurps the file into an array, then finds the array index that matches. Let's convert this code to read line-by-line so that we don't get huge memory consumption issues:

open my $fr,'<', $file1;
{ local $/="\n\n"; 
  while(<$fr>) {
    $n = $. if /$query/i;
  }
}

I think you understand most of that: while(<$fr>) reads line-by-line from $fr and sets each line to $_ for the loop iteration, /$query/i will implicitly match against $_ (which is what we want), but you're probably curious about this little bugger: $n = $.. From perldoc perlvar:

  • HANDLE->input_line_number(EXPR)
  • $INPUT_LINE_NUMBER
  • $NR
  • $.

Current line number for the last filehandle accessed.

Each filehandle in Perl counts the number of lines that have been read from it. (Depending on the value of $/ , Perl's idea of what constitutes a line may not match yours.) When a line is read from a filehandle (via readline() or <> ), or when tell() or seek() is called on it, $. becomes an alias to the line counter for that filehandle.

You can adjust the counter by assigning to $. , but this will not actually move the seek pointer. Localizing $. will not localize the filehandle's line count. Instead, it will localize perl's notion of which filehandle $. is currently aliased to.

$. is reset when the filehandle is closed, but not when an open filehandle is reopened without an intervening close(). For more details, see "I/O Operators" in perlop. Because <> never does an explicit close, line numbers increase across ARGV files (but see examples in eof).

You can also use HANDLE->input_line_number(EXPR) to access the line counter for a given filehandle without having to worry about which handle you last accessed.

(Mnemonic: many programs use "." to mean the current line number.)

So if we found a match in your third paragraph, $. would be 3. As a general recommendation, read through the perlvar page every once in a while. There are some gems in there, and even if you don't understand what everything is for, you'll get it on a reread.

However, the final thing I have to say is that mobrule's advice about explicitly storing paragraph information is probably the best way to go. I might shy away from a homemade format, but I understand if XML or something is a little to heavyweight for your purposes. (Just know that your purposes are likely to expand greatly if you're not careful).

Chris Lutz
`print <$eng>` evaluates `<$eng>` in list context. Try `print scalar <$eng>`
mobrule
Crap! Brains not working tonight.
Chris Lutz
Chris, thanks for the fix :)
Mike
Now the technical warning is gone :)
Mike
@mobrule, thanks for the suggestion :)
Mike
Now I know how to get rid of the unfriendly warning message but have trouble making the improved code work like the way it should. I don't quite understand what the following line does: <$eng> while --$n
Mike
@Chris, these three lines "<$eng> while --$n; print scalar <$eng>; close $eng;" are not working. For example, using my example files posted above, when I search for "elle est", the output on screen is "Verrieres is sheltered ... in Verrieres", which is not right.
Mike
@Mike - What it does is create a loop. `while --$n` will subtract 1 from `$n` and then, if `$n` is not zero, will execute the loop body. The loop body is `<$eng>` which reads a line from the file and throws it away. The idea is that if `$n` is the line number you want read, we read `$n - 1` lines from the file, so that the _next_ line that we read is the line we really want. The problem is in the indexing code. The indexing code assigns an _array index_ to `$n` and this code treats it as a _line number_ (which most people consider as starting at 1). Updating answer.
Chris Lutz
@Chris, thanks for the clarifications and the fix. Yes, I see it was the indexing number that caused the problem.
Mike
+2  A: 
mobrule
@mobrule, thanks, I've just tested your fix to modify the warning message. It works fine except for one thing. When I applied Chris' fix, the warning message becomes exactly "Oops, no match is found!" but this code gives me something else. Like, using the example files I post, if I search for "admin", I will receive a warning messsage "Oops, no match is found. Chapter 1".
Mike
+1  A: 

Here's a different approach for you to consider:

use strict;
use warnings;
use File::Slurp qw(read_file);

my %para = map { $_ => Read_paragraphs("$_.txt") } qw(FR EN);

my $query = 'La petite ville de';
my @matches = 
    map  { $para{EN}[$_] }
    grep { $para{FR}[$_] =~ /$query/ }
    0 .. @{$para{FR}} - 1
;

print $_, "\n" for @matches;

sub Read_paragraphs {
    return [split /\n{2,}/, read_file(shift)];
}
FM
Why are you using `join()` to slurp a file? `read_file()` will return the entire file as a scalar if you just say `$content = read_file(shift);` and it'll be more efficient because it won't have to split the file up and then rejoin the pieces.
Chris Lutz
@Chris Good to know. Thanks.
FM
@FM, thanks for sharing the code! I changed qw(FR EN) to qw(c:/FR c:/EN) and then ran the code but it gave me the warning saying "Can't use an undefined value as an ARRAY reference at c:/test.pl line 11.
Mike
@Mike `FR` and `EN` are the two hash keys. If you change them in one spot, you'll need to change them everywhere. Rather than doing that, perhaps a simpler way to make the adjustment you need is like this: `Read_paragraphs("C:/$_.txt") } qw(FR EN)`. That way, the hash keys remain short, and the full path information is used only where it's needed (the subroutine call to read the files).
FM
@FM, thanks for the code and the explanation :)
Mike