views:

664

answers:

2

The perldoc page for length() tells me that I should use bytes::length(EXPR) to find a Unicode string in bytes, or and the bytes page echoes this.

use bytes;
$ascii = 'Lorem ipsum dolor sit amet';
$unicode = 'Lørëm ípsüm dölör sît åmét';

print "ASCII: " . length($ascii) . "\n";
print "ASCII bytes: " . bytes::length($ascii) . "\n";
print "Unicode: " . length($unicode) . "\n";
print "Unicode bytes: " . bytes::length($unicode) . "\n";

The output of this script, however, disagrees with the manpage:

ASCII: 26
ASCII bytes: 26
Unicode: 35
Unicode bytes: 35

It seems to me length() and bytes::length() return the same for both ASCII & Unicode strings. I have my editor set to write files as UTF-8 by default, so I figure Perl is interpreting the whole script as Unicode—does that mean length() automatically handles Unicode strings properly?

Edit: See my comment; my question doesn't make a whole lot of sense, because length() is not working "properly" in the above example - it is showing the length of the Unicode string in bytes, not characters. The reson I originally stumbled across this is for a program in which I need to set the Content-Lenth header (in bytes) in an HTTP message. I had read up on Unicode in Perl and was expecting to have to do some fanciness to make things work, but when length() returned exactly what I needed right of the bat, I was confused! See the accepted answer for an overview of use utf8, use bytes, and no bytes in Perl.

+10  A: 

If your scripts are encoded in UTF-8, then please use the utf8 pragma. The bytes pragma on the other hand will force byte semantics on length, even if the string is UTF-8. Both work in the current lexical scope.

$ascii = 'Lorem ipsum dolor sit amet';
{
    use utf8;
    $unicode = 'Lørëm ípsüm dölör sît åmét';
}
$not_unicode = 'Lørëm ípsüm dölör sît åmét';

no bytes; # default, can be omitted
print "Character semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

print "----\n";

use bytes;
print "Byte semantics:\n";

print "ASCII: ", length($ascii), "\n";
print "Unicode: ", length($unicode), "\n";
print "Not-Unicode: ", length($not_unicode), "\n";

This outputs:

Character semantics:
ASCII: 26
Unicode: 26
Not-Unicode: 35
----
Byte semantics:
ASCII: 26
Unicode: 35
Not-Unicode: 35
Inshallah
+2  A: 

The purpose of the bytes pragma is to replace the length function (and several other string related functions) in the current scope. So every call to length in your program is a call to the length that bytes provides. This is more in line with what you were trying to do:

#!/usr/bin/perl

use strict;
use warnings;

sub bytes($) {
    use bytes;
    return length shift;
}

my $ascii = "foo"; #really UTF-8, but everything is in the ASCII range
my $utf8  = "\x{24d5}\x{24de}\x{24de}";

print "[$ascii] characters: ", length $ascii, "\n",
    "[$ascii] bytes     : ", bytes $ascii, "\n",
    "[$utf8] characters: ", length $utf8, "\n",
    "[$utf8] bytes     : ", bytes $utf8, "\n";

Another subtle flaw in your reasoning is that there is such a thing as Unicode bytes. Unicode is an enumeration of characters. It says, for instance, that the U+24d5 is &#x24d5 (CIRCLED LATIN SMALL LETTER F); What Unicode does not specify how many bytes a character takes up. That is left to the encodings. UTF-8 says it takes up 3 bytes, UTF-16 says it takes up 2 bytes, UTF-32 says it takes 4 bytes, etc. Here is comparison of Unicode encodings. Perl uses UTF-8 for its strings by default. UTF-8 has the benefit of being identical in every way to ASCII for the first 127 characters.

Chas. Owens