ansaurus

Question

How do I read UTF-8 with diamond operator (<>)?

Answer 1

+19 A:

Try to use the pragma open instead:

use strict;
use warnings;
use open qw(:std :utf8);

while(<>){
    my @chars = split //, $_;
    print "$_" foreach(@chars);
}

You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in @ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in @ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when @ARGV has files. In this case the <> operator opens a new file handle for each file in @ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.

potyl 2009-02-06 06:52:09

Answer 2

+8 A:

Your script works if you do this:

#!/usr/bin/perl -w

binmode STDOUT, ':utf8';

while(<>){
    binmode ARGV, ':utf8';

    my @chars = split //, $_;
    print "$_\n" foreach(@chars);
}

The magic filehandle that <> reads from is called *ARGV, and it is opened when you call readline.

But really, I am a fan of explicitly using Encode::decode and Encode::encode when appropriate.

jrockway 2009-02-06 08:33:17

Do you have to have the binmode in the while because ARGV is reset for multiple files?

brian d foy 2009-02-09 02:21:15

experimentally, yes :)

jrockway 2009-02-09 05:13:09

Answer 3

+5 A:

You can switch on UTF8 by default with the -C flag:

perl -CSD -ne 'print join("\n",split //);' utf8.txt

The switch -CSD turns on UTF8 unconditionally; if you use simply -C it will turn on UTF8 only if the relevant environment variables (LC_ALL, LC_TYPE and LANG) indicate so. See perlrun for details.

This is not recommended if you don't invoke perl directly (in particular, it might not work reliably if you pass options to perl from the shebang line). See the other answers in that case.

Bruno De Fraine 2009-02-06 08:50:27

There is issue with -C switch since perl 5.10 http://www.fi.muni.cz/~kas/blog/index.cgi/computers/too-late-for-cs-howto.html

Hynek -Pichi- Vychodil 2009-02-06 09:05:01

Off topic: Using '#!/usr/bin/perl' is not recommended shebang line, see perlrun for details. If you don't wont perlrun approach use #!/usr/bin/env perl which is more portable than #!/usr/bin/perl

Hynek -Pichi- Vychodil 2009-02-06 09:09:51

Thanks, I made it clear you should only use this when you invoke perl directly.

Bruno De Fraine 2009-02-06 13:05:18

ansaurus

tags:

views:

answers:

How do I read UTF-8 with diamond operator (<>)?

related questions