views:

106

answers:

3

Okay, this should be really simple, but I have searched all over for the answer and also read the following thread: http://stackoverflow.com/questions/1326539/finding-the-length-of-a-unicode-string-in-perl

It does not help me. I know how to get Perl to treat a string constant as UTF-8 and return the right number of chars (instead of bytes) but somehow it doesn't work when Perl receives the string via my AJAX call.

Below, I am posting the three Greek letters Alpha, Beta and Omega in unicode. Perl tells me length is 6 (bytes) when it should tell me only 3 (chars). How do I get the correct char count?

#!/usr/bin/perl
use strict;

if ($ENV{CONTENT_LENGTH}) {
    binmode (STDIN, ":utf8");
    read (STDIN, $_, $ENV{CONTENT_LENGTH});
    s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
    print "Content-Type: text/html; charset=UTF-8\n\nReceived: $_ (".length ($_)." chars)";
    exit;
}

print "Content-Type: text/html; charset=UTF-8\n\n";
print qq[<html><head><script>
        var oRequest;
        function MakeRequest () {
            oRequest = new XMLHttpRequest();
            oRequest.onreadystatechange = zxResponse;
            oRequest.open ('POST', '/test/unicode.cgi', true);
            oRequest.send (encodeURIComponent (document.oForm.oInput.value));
        }
        function zxResponse () {
            if (oRequest.readyState==4 && oRequest.status==200) {
                alert (oRequest.responseText);
            }
        }
    </script></head><body>
        <form name="oForm" method="POST">
            <input type="text" name="oInput" value="&#x03B1;&#x03B2;&#x03A9;">
            <input type="button" value="Ajax Submit" onClick="MakeRequest();">
        </form>
    </body></html>
];

By the way, the code is intentially simplified (I know how to make a cross-browser AJAX call, etc.) and using the CGI Perl module is not an option.

+8  A: 

You decode this string before calling length. For example:

use Encode;

my $utf_string = decode_utf8($_); ## parse string to find utf8 octets
print length($utf_string);

From encode manual:

$string = decode_utf8($octets [, CHECK]);

equivalent to $string = decode("utf8", $octets [, CHECK]) . The sequence of octets represented by $octets is decoded from UTF-8 into a sequence of logical characters. Not all sequences of octets form valid UTF-8 encodings, so it is possible for this call to fail. For CHECK, see Handling Malformed Data.

Ivan Nevostruev
Thanks for your quick and clever response. I may be too much of a purist, but I would prefer a non-module solution. Memory usage is an important factors here and my intuition (tell me if it is wrong) is that it is better to have a few lines of inline codes for a simple issue like this. The fail possibility also makes me a bit wary. If a “standalone” solution cannot be found, I will be happy to accept your answer.
W3Coder
@W3Coder: `Encode` is a core module and certainly should be considered a preferred way to accomplish this. This is a good solution in other words.
drewk
Asking for a non-module solution for a Unicode problem is like asking how to breathe without using your lungs.
brian d foy
I was just assuming that this solution would work, but it turns out it doesn't. Maybe I am doing something wrong: I added "use Encode;" after "use strict;" and "length (decode_utf8 ($_))" instead of "length ($_)". Perl still reports 6 chars! +1 for effort and conventional Perl wisdom :)
W3Coder
+4  A: 

For a "native" way to accomplish this, you can convert as you copy with this method:

Set the mode on an in memory file to the mode desired and read from that. This will make the conversion as the characters are read.

use strict;
use warnings;

my $utf_str = "αβΩ"; #alpha; bravo; omega

print "$utf_str is ", length $utf_str, " characters\n";

use open ':encoding(utf8)';
open my $fh, '<', \$utf_str;

my $new_str;

{ local $/; $new_str=<$fh>; }

binmode(STDOUT, ":utf8");
print "$new_str ", length $new_str, " characters"; 

#output:
αβΩ is 6 characters
αβΩ 3 characters

If you want to convert the encoding in place, you can use this:

my $utf_str = "αβΩ";
print "$utf_str is ", length $utf_str, " characters\n";
binmode(STDOUT, ":utf8");
utf8::decode($utf_str);
print "$utf_str is ", length $utf_str, " characters\n";

#output:
αβΩ is 6 characters
αβΩ is 3 characters

You should not shy away from Encode however.

drewk
Huh, that's an interesting way to handle that.
brian d foy
Very interesting indeed, thanks. So, for a web site with thousands of (small chat) postings every minute, should I go for Ivan's Encode solution, Drewk's in-memory file idea, or utf8::decode? I *think* posts will always be in UTF-8 as they are encoded with JavaScript's encodeURIComponent function.
W3Coder
As I stated in the post and the comment to Ivan's post, there is absolutely nothing wrong with `Encode` or the two solutions I gave you here. Probably the most efficient is the `utf8::decode` in the second method. It is fast, native, and it converts in place. If you want to keep the original string and have a local copy use `Encode`.
drewk
@brian d foy: Kinda backwards from our previous discussion. Now its the filehandle that is making the copy. :)
drewk
I am going with the utf8::decode solution, I have tested it and it works and, as stated, the text input will always be in unicode. Thanks!
W3Coder
@W3Coder: Thanks for the positive feedback! It is nice to know that it is helpful.
drewk
+1  A: 

use utf8::decode if you know the string is in utf8. It's core and there is no memory usage penalty:

Basic do-nothing loop memory usage:

$ perl -e 'sleep 1 while 1' &
[1] 17372
$ ps u | grep 17372 | grep -v grep
okram    17372  0.0  0.1   5464  1172 pts/0    S    01:24   0:00 perl -e [...]

Memory usage with Encode:

$ perl -MEncode -e 'sleep 1 while 1' &
[1] 17488
$ ps u | grep 17488 | grep -v grep
okram    17488  0.7  0.2   6020  2224 pts/0    S    01:27   0:00 perl [...]

The proposed way:

$ perl -e '$str="ææææ";utf8::decode $str;print length $str,"\n\n";
sleep 1 while 1' &
[1] 17554
$ 4
$ ps u | grep 17554| grep -v grep
okram    17554  0.0  0.1   5464  1176 pts/0    S    01:28   0:00 perl -e [...]

As you can see, the length of the string after utf8::decode is 4 for that utf8 string, and the memory usage is pretty much the same as the baseline while(1). Encode seems to consume a bit more memory...

mfontani
The memory use is a red herring. In the context of everything else going on, it's most likely not going to matter. Note that the utf8 docs recommend *not* using `decode` and encourage people to use Encode.
brian d foy
Encode is preferred when one's dealing with arbitrary encodings, from what I understand: quoting, "Note that this function does not handle arbitrary encodings. Therefore Encode is recommended for the general purposes; see also Encode." -- since the requester wanted a non-module solution, this is as close as I could get ;)
mfontani
+1 for going through the trouble of testing, thanks
W3Coder