tags:

views:

220

answers:

5

The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!

Here's the minimized code to exhibit my problem:

The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding. It has the following three lines:

abacus  æbәkәs
abalone æbәlәuni
abandon әbændәn

The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding. It contains the following code:

#!perl -w

use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";

In the output, the hash table seems to be okay:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

But it is actually not, because I only get two values instead of three:

æbәlәuni
әbændәn

Perl gives the following warning message:

Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i n> line 3.

where's the problem? Can someone kindly explain? Thanks.

The Solution

Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :) As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.

To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:

#!perl -w

use Data::Dumper;
use strict;
use autodie;

open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

Now, the output is exactly what I expected:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };
æbәkәs
æbәlәuni
әbændәn

Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.

Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.

Note To clarify a little more, if I use:

open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

The output is this:

$VAR1 = {
          'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
          'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
          "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
        };
æbәlәuni
әbændәn

And the warning message:

Use of uninitialized value in print at C:\hash_test.pl line 13,  line 3.
A: 

split/\s/ instead of split/\t/

apbianco
`split /\s/` is different than `split ' '`. Unfortunately, many people use the former rather than the latter when they mean the latter.
Sinan Ünür
@apbianco, the problem still persists.
Mike
@all, it's really frustrating :( Maybe it's because I'm running on Windows XP (Chinese version) and there's some encodings incompatibility? But I already took precaution and the data file as UTF-8.
Mike
Wait: 'abacus' => 'æbәkәs ' -- did you cut'n'paste or did you type the output? Is that a space at the end of `æbәkәs' ?
apbianco
@apbianco Actually, the crucial one that would give rise to the warning is the extra space character in `' abacus'`. It was not visible before I reformatted his post to use `<pre>`.
Sinan Ünür
Sinan: "different from".
Svante
@Svante Thank you for the correction. Editing in the comment box is error-prone and I forgot to correct that. Also, as a humorous side note *A few malcontents will have none of this, claiming that in the UK it's considered perfectly proper to use different than in a prepositional construction. So?* http://www.straightdope.com/columns/read/2295/is-different-than-bad-grammar One of my English teachers in high school was an Oxford grad. I blame it on him.
Sinan Ünür
+6  A: 
Sinan Ünür
@Sinan, thanks. But the problem persists and warning message remains the same. As I observe, on my system, when the data file is encoded as utf8 and the Perl script is aslo saved as utf-8, I don't have to use "<:utf8" format.
Mike
@Sinan, to solve this first line entry missing problem, it seems that I have to add an empty line with a whitespace and a tab and another whitespace and a \n.
Mike
@Sinan, the culprit IS the system. I saved the script and the data file and the output file all as the system default encoding, "GB2312" and although some characters won't display properly, all the hash elements are there.
Mike
@Sinan, thanks! Yes, it is "\x{fefe}" that causes the problem. They are inserted at the start of the first line meant to indicate that the data is utf encoded but due to some reason, my OS does not treat it the right way. I'm pretty sure that I've done everything necessary to ensure that the data is read as utf-8 encoding.
Mike
Notepad adds the BOM when you save as UTF-8. To read that properly you need to open with `'<:utf8'`. That's 100% as it should be and there's no "problem".
hobbs
@hobbs, thanks. But the thing is adding '<:utf8' does not solve my problem. Please read my updated post. Thanks.
Mike
@Sinan, thanks again. I've solved the problem. Please kindly read my uupdated post. Thanks :)
Mike
A: 

Works For Me. Are you sure your example matches your actual code and data?

Sorpigal
@Sorpigal, Thanks for trying to be helpful. I suppose this answer was downvoted because it was more like a comment than an answer.
Mike
+2  A: 

If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.

#! /usr/bin/env perl
use Data::Dumper;
open my $in,  '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";

my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";

If you want it to be more robust, it is recommended to use :encoding(utf8) instead of :utf8, for reading a file.

open my $in, '<:encoding(utf8)', "hash_test.txt";

Read PerlIO for more information.

Brad Gilbert
@Brad, thanks. But I've already tried all these utf-8 configurations, they do not seem to be of help.
Mike
Are you sure that the original text is in UTF8?
Brad Gilbert
@Brad, yes, I'm 100% sure that the original text is in UTF8. Without UTF-8 encoding, my OS simply won't display those characters properly. Please kindly read my updated post.
Mike
+1  A: 

I think your answer may be sitting right in front of you. The output from Data::Dumper which you posted is:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

Notice the character between the ' and abacus? You tried to access the third value via $hash{abacus}. This is incorrect because of that character before abacus in the Dumper() hash. You could try plugging it into a loop which should take care of it:

foreach my $k (keys %hash) {
  print $out $hash{$k};
}
Jack M.