ansaurus

Question

Convert UTF8 string into numeric values in Perl

Answer 1

+3 A:

foreach my $c (split(//, $str))
{
    print ord($c), "\n";
}

Or compressed into a single line: my @chars = map { ord } split //, $str;

Data::Dumpered, this produces:

Ether 2010-08-22 17:35:09

Answer 2

+3 A:

To have utf8 in your source code recognized as such, you must use utf8; beforehand:

$ perl
use utf8;
my $str = '中國c'; # Chinese language of china
foreach my $c (split(//, $str))
{
    print ord($c), "\n";
}
__END__
20013
22283
99

or more tersely,

print join ',', map ord, split //, $str;

ysth 2010-08-22 18:20:33

Answer 3

+4 A:

unpack will be more efficient than split and ord, because it doesn't have to make a bunch of temporary 1-character strings:

use utf8;

my $str = '中國c'; # Chinese language of china

my @codepoints = unpack 'U*', $str;

print join(',', @codepoints) . "\n"; # prints 20013,22283,99

A quick benchmark shows it's about 3 times faster than split+ord:

use utf8;
use Benchmark 'cmpthese';

my $str = '中國中國中國中國中國中國中國中國中國中國中國中國中國中國c';

cmpthese(0, {
  'unpack'     => sub { my @codepoints = unpack 'U*', $str; },
  'split-map'  => sub { my @codepoints = map { ord } split //, $str },
  'split-for'  => sub { my @cp; for my $c (split(//, $str)) { push @cp, ord($c) } },
  'split-for2' => sub { my $cp; for my $c (split(//, $str)) { $cp = ord($c) } },
});

Results:

               Rate  split-map  split-for split-for2     unpack
split-map   85423/s         --        -7%       -32%       -67%
split-for   91950/s         8%         --       -27%       -64%
split-for2 125550/s        47%        37%         --       -51%
unpack     256941/s       201%       179%       105%         --

The difference is less pronounced with a shorter string, but unpack is still more than twice as fast. (split-for2 is a bit faster than the other splits because it doesn't build a list of codepoints.)

cjm 2010-08-22 21:59:48

ansaurus

tags:

views:

answers:

Convert UTF8 string into numeric values in Perl

related questions