ansaurus

Question

Answer 1

+2 A:

You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -

Perl uses logically-wide characters to represent strings internally.

it will confirm your statements.

Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

weismat 2010-06-03 08:41:09

Your penultimate sentence makes no sense at all. You seem to refer to fonts, but this has nothing to do with encodings.

daxim 2010-06-03 16:30:57

Answer 2

+3 A:

Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)

Before printing you need to encode the characters back to bytes.

use Encode;  
utf8::encode($contents);

There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)

Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

Oh, and it must use multi-byte strings, because otherwise it's just not unicode.

dylan 2010-06-03 12:48:24

By multi-byte strings I meant variable-width encoding.

n0rd 2010-06-03 13:29:22

Anyway I don't get why do I have to do conversion explicitly: I specified input data encoding why do I have to take some additional steps?

n0rd 2010-06-03 13:31:03

You've specified the input encoding. You do your stuff. Then you specify your output encoding. The articles I referred to explain better, I should think.

dylan 2010-06-03 14:45:34

Do not use the functions from the `utf8` package. The docs say: **Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.** Instead always use the `Encode` module.

daxim 2010-06-03 16:37:04

Answer 3

+2 A:

Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.

In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.

Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.

Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.

Ven'Tatsu 2010-06-03 15:55:18

If defualt encoding was 8 bit ASCII (or any other 8 bit encoding), why Perl prints UTF-8 strings as raw bytes (i.e. printing two characters to console for each cyrillic character in printed string) instead of printing the result of transcoding into that encoding that would have exactly same amount of characters as in original string?

n0rd 2010-06-03 17:38:48

@n0rd a UTF-8 string is not bytes from the perl perspective, it's characters. An odd result of this IIRC is that when printed to a handle without encoding defined it will truncate the Unicode code points greater than 255 to just the lower 8-bits.

Ven'Tatsu 2010-06-05 23:07:31

ansaurus

tags:

views:

answers:

Perl strings internals

related questions