views:

324

answers:

3

How do I treat the elements of @ARGV as UTF-8 in Perl?

Currently I'm using the following work-around ..

use Encode qw(decode encode);

my $foo = $ARGV[0];
$foo = decode("utf-8", $foo);

.. which works but is not very elegant.

I'm using Perl v5.8.8 which is being called from bash v3.2.25 with a LANG set to en_US.UTF-8.

+1  A: 

You shouldn't have to do anything special to the string. Perl strings are in UTF-8 by default starting with Perl 5.8.

perl -CO -le 'print "\x{2603}"' | xargs perl -le 'print "I saw @ARGV"'

The code above works just fine on Ubuntu 9.04, OS X 10.6, and FreeBSD 7.

FalseVinylShrub brings up a good point, We can see a definite difference between

perl -Mutf8 -wle ';print utf8::is_utf8($ARGV[0]) ? "t" : "f"' a

and

perl -Mutf8 -CA -wle ';print utf8::is_utf8($ARGV[0]) ? "t" : "f"' a
Chas. Owens
The command-line arguments don't start life as Perl strings, though. It's an external data source like anything else.
brian d foy
But if his or her shell is set to UTF-8, then anything he or she types will be in UTF-8.
Chas. Owens
That doesn't mean a co-worker's shell will be set to that :)
brian d foy
I find it easier to specify the working environment than to try cover all possible environments. Now, if this is meant to be distributed to other people, that changes things, but the question included the fact that the terminal will be set to UTF-8. Similarly, most of the time I don't mess with `File::Spec`, even though my code won't work on certain systems.
Chas. Owens
+1  A: 

The way you've done it seems correct. That's what I would do.

However, this perldoc page http://perldoc.perl.org/perlrun.html#*-C-[number/list]* suggests that the command line flag -CA should tell it to treat @ARGV as utf-8. (not tested).

FalseVinylShrub
-CA expects to command-line arguments to be encoded as UTF-8. That doesn't mean that they are. :)
brian d foy
Thanks for the info, so you're saying this way assumes UTF-8 encoding, but your way goes and finds out the encoding...?
FalseVinylShrub
I've found that it's never safe to assume any encoding. Too many people get it to work on their machine then find out it breaks for someone else who has a different setup.
brian d foy
+9  A: 

Outside data sources are tricky in Perl. For command-line arguments, you're probably getting them as the encoding specified in your locale. Don't rely on your locale to be the same as someone else who might run your program.

You have to find out what that is then convert to Perl's internal format. Fortunately, it's not that hard.

The I18N::Langinfo module has the stuff you need to get the encoding:

    use I18N::Langinfo qw(langinfo CODESET);
    my $codeset = langinfo(CODESET);

Once you know the encoding, you can decode them to Perl strings:

    use Encode qw(decode);
    @ARGV = map { decode $codeset, $_ } @ARGV;

Although Perl encodes internal strings as UTF-8, you shouldn't ever think or know about that. You just decode whatever you get, which turns it into Perl's internal representation for you. Trust that Perl will handle everything else. When you need to store the data, ensure that you use the encoding you like.

brian d foy