ansaurus

Question

How do I get the length of a Perl Unicode string input via Ajax or CGI?

Answer 1

+8 A:

You decode this string before calling length. For example:

use Encode;

my $utf_string = decode_utf8($_); ## parse string to find utf8 octets
print length($utf_string);

From encode manual:

$string = decode_utf8($octets [, CHECK]);

equivalent to $string = decode("utf8", $octets [, CHECK]) . The sequence of octets represented by $octets is decoded from UTF-8 into a sequence of logical characters. Not all sequences of octets form valid UTF-8 encodings, so it is possible for this call to fail. For CHECK, see Handling Malformed Data.

Ivan Nevostruev 2010-09-13 22:02:16

Thanks for your quick and clever response. I may be too much of a purist, but I would prefer a non-module solution. Memory usage is an important factors here and my intuition (tell me if it is wrong) is that it is better to have a few lines of inline codes for a simple issue like this. The fail possibility also makes me a bit wary. If a “standalone” solution cannot be found, I will be happy to accept your answer.

W3Coder 2010-09-13 23:00:34

@W3Coder: `Encode` is a core module and certainly should be considered a preferred way to accomplish this. This is a good solution in other words.

drewk 2010-09-14 00:18:43

Asking for a non-module solution for a Unicode problem is like asking how to breathe without using your lungs.

brian d foy 2010-09-14 05:19:08

I was just assuming that this solution would work, but it turns out it doesn't. Maybe I am doing something wrong: I added "use Encode;" after "use strict;" and "length (decode_utf8 ($_))" instead of "length ($_)". Perl still reports 6 chars! +1 for effort and conventional Perl wisdom :)

W3Coder 2010-09-14 20:34:11

Answer 2

+4 A:

For a "native" way to accomplish this, you can convert as you copy with this method:

Set the mode on an in memory file to the mode desired and read from that. This will make the conversion as the characters are read.

use strict;
use warnings;

my $utf_str = "αβΩ"; #alpha; bravo; omega

print "$utf_str is ", length $utf_str, " characters\n";

use open ':encoding(utf8)';
open my $fh, '<', \$utf_str;

my $new_str;

{ local $/; $new_str=<$fh>; }

binmode(STDOUT, ":utf8");
print "$new_str ", length $new_str, " characters"; 

#output:
αβΩ is 6 characters
αβΩ 3 characters

If you want to convert the encoding in place, you can use this:

my $utf_str = "αβΩ";
print "$utf_str is ", length $utf_str, " characters\n";
binmode(STDOUT, ":utf8");
utf8::decode($utf_str);
print "$utf_str is ", length $utf_str, " characters\n";

#output:
αβΩ is 6 characters
αβΩ is 3 characters

You should not shy away from Encode however.

drewk 2010-09-14 00:12:58

Huh, that's an interesting way to handle that.

brian d foy 2010-09-14 07:00:13

Very interesting indeed, thanks. So, for a web site with thousands of (small chat) postings every minute, should I go for Ivan's Encode solution, Drewk's in-memory file idea, or utf8::decode? I *think* posts will always be in UTF-8 as they are encoded with JavaScript's encodeURIComponent function.

W3Coder 2010-09-14 08:10:10

As I stated in the post and the comment to Ivan's post, there is absolutely nothing wrong with `Encode` or the two solutions I gave you here. Probably the most efficient is the `utf8::decode` in the second method. It is fast, native, and it converts in place. If you want to keep the original string and have a local copy use `Encode`.

drewk 2010-09-14 13:47:18

@brian d foy: Kinda backwards from our previous discussion. Now its the filehandle that is making the copy. :)

drewk 2010-09-14 13:49:12

I am going with the utf8::decode solution, I have tested it and it works and, as stated, the text input will always be in unicode. Thanks!

W3Coder 2010-09-14 21:02:39

@W3Coder: Thanks for the positive feedback! It is nice to know that it is helpful.

drewk 2010-09-15 00:32:29

Answer 3

+1 A:

use utf8::decode if you know the string is in utf8. It's core and there is no memory usage penalty:

Basic do-nothing loop memory usage:

$ perl -e 'sleep 1 while 1' &
[1] 17372
$ ps u | grep 17372 | grep -v grep
okram    17372  0.0  0.1   5464  1172 pts/0    S    01:24   0:00 perl -e [...]

Memory usage with Encode:

$ perl -MEncode -e 'sleep 1 while 1' &
[1] 17488
$ ps u | grep 17488 | grep -v grep
okram    17488  0.7  0.2   6020  2224 pts/0    S    01:27   0:00 perl [...]

The proposed way:

$ perl -e '$str="ææææ";utf8::decode $str;print length $str,"\n\n";
sleep 1 while 1' &
[1] 17554
$ 4
$ ps u | grep 17554| grep -v grep
okram    17554  0.0  0.1   5464  1176 pts/0    S    01:28   0:00 perl -e [...]

As you can see, the length of the string after utf8::decode is 4 for that utf8 string, and the memory usage is pretty much the same as the baseline while(1). Encode seems to consume a bit more memory...

mfontani 2010-09-14 00:32:37

The memory use is a red herring. In the context of everything else going on, it's most likely not going to matter. Note that the utf8 docs recommend *not* using `decode` and encourage people to use Encode.

brian d foy 2010-09-14 07:02:55

Encode is preferred when one's dealing with arbitrary encodings, from what I understand: quoting, "Note that this function does not handle arbitrary encodings. Therefore Encode is recommended for the general purposes; see also Encode." -- since the requester wanted a non-module solution, this is as close as I could get ;)

mfontani 2010-09-14 14:17:12

+1 for going through the trouble of testing, thanks

W3Coder 2010-09-14 21:04:16

ansaurus

tags:

views:

answers:

How do I get the length of a Perl Unicode string input via Ajax or CGI?

related questions