views:

75

answers:

3

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?

I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.

Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.

First, I read file into string like this:

open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');

my $contents;

{
    local $/;
    $contents = <$in>;
}

close($in);

then simply print it:

print $contents;

And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character". (I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).

Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).

If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.

Update.

I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.

Thanks to links, now I have much more solid understanding on what's going on and how things should be done.

+2  A: 

You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -

Perl uses logically-wide characters to represent strings internally.

it will confirm your statements.

Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

weismat
Your penultimate sentence makes no sense at all. You seem to refer to fonts, but this has nothing to do with encodings.
daxim
+3  A: 

Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)

Before printing you need to encode the characters back to bytes.

use Encode;  
utf8::encode($contents);

There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)

Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

Oh, and it must use multi-byte strings, because otherwise it's just not unicode.

dylan
By multi-byte strings I meant variable-width encoding.
n0rd
Anyway I don't get why do I have to do conversion explicitly: I specified input data encoding why do I have to take some additional steps?
n0rd
You've specified the input encoding. You do your stuff. Then you specify your output encoding. The articles I referred to explain better, I should think.
dylan
Do not use the functions from the `utf8` package. The docs say: **Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.** Instead always use the `Encode` module.
daxim
+2  A: 

Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.

In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.

Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.

Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.

Ven'Tatsu
If defualt encoding was 8 bit ASCII (or any other 8 bit encoding), why Perl prints UTF-8 strings as raw bytes (i.e. printing two characters to console for each cyrillic character in printed string) instead of printing the result of transcoding into that encoding that would have exactly same amount of characters as in original string?
n0rd
@n0rd a UTF-8 string is not bytes from the perl perspective, it's characters. An odd result of this IIRC is that when printed to a handle without encoding defined it will truncate the Unicode code points greater than 255 to just the lower 8-bits.
Ven'Tatsu