ansaurus

Question

Why can't I use the map function to create a good hash from a simple data file in Perl?

Answer 1

A:

split/\s/ instead of split/\t/

apbianco 2009-11-19 12:42:03

`split /\s/` is different than `split ' '`. Unfortunately, many people use the former rather than the latter when they mean the latter.

Sinan Ünür 2009-11-19 12:45:26

@apbianco, the problem still persists.

Mike 2009-11-19 12:54:23

@all, it's really frustrating :( Maybe it's because I'm running on Windows XP (Chinese version) and there's some encodings incompatibility? But I already took precaution and the data file as UTF-8.

Mike 2009-11-19 12:56:41

Wait: 'abacus' => 'æbәkәs ' -- did you cut'n'paste or did you type the output? Is that a space at the end of `æbәkәs' ?

apbianco 2009-11-19 13:15:21

@apbianco Actually, the crucial one that would give rise to the warning is the extra space character in `' abacus'`. It was not visible before I reformatted his post to use `<pre>`.

Sinan Ünür 2009-11-19 13:19:16

Sinan: "different from".

Svante 2009-11-19 13:40:38

@Svante Thank you for the correction. Editing in the comment box is error-prone and I forgot to correct that. Also, as a humorous side note *A few malcontents will have none of this, claiming that in the UK it's considered perfectly proper to use different than in a prepositional construction. So?* http://www.straightdope.com/columns/read/2295/is-different-than-bad-grammar One of my English teachers in high school was an Oxford grad. I blame it on him.

Sinan Ünür 2009-11-19 21:06:12

Answer 2

+6 A:

Sinan Ünür 2009-11-19 13:03:55

@Sinan, thanks. But the problem persists and warning message remains the same. As I observe, on my system, when the data file is encoded as utf8 and the Perl script is aslo saved as utf-8, I don't have to use "<:utf8" format.

Mike 2009-11-19 13:14:45

@Sinan, to solve this first line entry missing problem, it seems that I have to add an empty line with a whitespace and a tab and another whitespace and a \n.

Mike 2009-11-19 13:16:39

@Sinan, the culprit IS the system. I saved the script and the data file and the output file all as the system default encoding, "GB2312" and although some characters won't display properly, all the hash elements are there.

Mike 2009-11-19 13:21:24

@Sinan, thanks! Yes, it is "\x{fefe}" that causes the problem. They are inserted at the start of the first line meant to indicate that the data is utf encoded but due to some reason, my OS does not treat it the right way. I'm pretty sure that I've done everything necessary to ensure that the data is read as utf-8 encoding.

Mike 2009-11-19 15:23:00

Notepad adds the BOM when you save as UTF-8. To read that properly you need to open with `'<:utf8'`. That's 100% as it should be and there's no "problem".

hobbs 2009-11-19 20:29:02

@hobbs, thanks. But the thing is adding '<:utf8' does not solve my problem. Please read my updated post. Thanks.

Mike 2009-11-20 00:58:26

@Sinan, thanks again. I've solved the problem. Please kindly read my uupdated post. Thanks :)

Mike 2009-11-20 00:59:28

Answer 3

A:

Works For Me. Are you sure your example matches your actual code and data?

Sorpigal 2009-11-19 13:09:05

@Sorpigal, Thanks for trying to be helpful. I suppose this answer was downvoted because it was more like a comment than an answer.

Mike 2009-11-21 11:03:02

Answer 4

+2 A:

If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.

#! /usr/bin/env perl
use Data::Dumper;
open my $in,  '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";

my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";

If you want it to be more robust, it is recommended to use :encoding(utf8) instead of :utf8, for reading a file.

open my $in, '<:encoding(utf8)', "hash_test.txt";

Read PerlIO for more information.

Brad Gilbert 2009-11-19 14:25:44

@Brad, thanks. But I've already tried all these utf-8 configurations, they do not seem to be of help.

Mike 2009-11-19 15:26:30

Are you sure that the original text is in UTF8?

Brad Gilbert 2009-11-19 16:53:53

@Brad, yes, I'm 100% sure that the original text is in UTF8. Without UTF-8 encoding, my OS simply won't display those characters properly. Please kindly read my updated post.

Mike 2009-11-20 01:01:01

Answer 5

+1 A:

I think your answer may be sitting right in front of you. The output from Data::Dumper which you posted is:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

Notice the character between the ' and abacus? You tried to access the third value via $hash{abacus}. This is incorrect because of that character before abacus in the Dumper() hash. You could try plugging it into a loop which should take care of it:

foreach my $k (keys %hash) {
  print $out $hash{$k};
}

Jack M. 2009-11-19 14:57:44

ansaurus

tags:

views:

answers:

Why can't I use the map function to create a good hash from a simple data file in Perl?

related questions