tags:

views:

156

answers:

2

The utf8 pragma and utf8 encodings on filehandles have me confused. For example, this apparently straightforward code...

use utf8;
print qq[fü];

To be clear, the hex dump on "fü" is 66 c3 bc which if I'm not mistaken is proper UTF8.

That prints 66 fc which is not UTF8 but Unicode or maybe Latin-1. Turn off use utf8 and I get 66 c3 bc. This is the opposite of what I'd expect.

Now let's add in filehandle pramgas.

use utf8;
binmode *STDOUT, ':encoding(utf8)';
print qq[fü];

Now I get 66 c3 bc. But remove use utf8 and I get 66 c3 83 c2 bc which doesn't make any sense to me.

What's the right thing to do to make my code DWIM with UTF8?

PS My locale is set to "en_US.UTF-8" and Perl 5.10.1.

+6  A: 

use utf8; states that your source code is encoded in UTF8. By adding

binmode *STDOUT, ':encoding(utf8)';
print qq[fü];

you are asking that the script's output be encoded in UTF8 as well.

If you had written

print "f\x{00FC}\n";

you would not have needed use utf8;.

Sinan Ünür
To be safe I have to turn on utf8 both in the code and in every filehandle I might write to? Any way to make that magically happen?
Schwern
`use open ":encoding(utf8)"; use open ":std";` appears to be the necessary magic.
Schwern
As Sinan wrote: You have to turn on utf8 to tell Perl that your source code is using UTF-8. You have to encode text written to your write filehandles and decode text read from your read filehandles if your files contain UTF-8.
Nele Kosog
If you're running on other people's systems, you probably want `use open ':locale';` so you'll respect their settings. They may not want UTF-8. (Note that `:locale` implies `:std`.)
cjm
A: 

use utf8; simply indicates that your source code (including string literals) is in UTF-8. You also need to set the encoding of your input & output streams.

You probably want to set the PERL_UNICODE variable in your environment. I set it to SAL, which breaks down like this:

  • S STDIN/STDOUT/STDERR are UTF-8
  • A @ARGV is UTF-8
  • L but only in a UTF-8 locale

See PERL_UNICODE and the -C option in perlrun.

You can also use the open pragma to set a default encoding.

If you're doing this in a module that you're distributing to others, you probably want

use open ':locale';

so it won't unexpectedly turn on UTF-8 for people who don't use a UTF-8 locale.

cjm
Thanks. I'd use that but I have to do this as part of a module. That appears to be necessary at startup time.
Schwern