tags:

views:

310

answers:

4

Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/). I have looked at Unicode::UCD, and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it.

A: 

which characters /\d/ match depends entirely on your regexp implementation (although standard 0-9 are guaranteed). In the case of perl the perl locale used defines which characters are considered alphabetic and digits.

ewanm89
Perl transforms strings into utf8 before running them through the regex engine. The only thing that perl locale affects is how a raw byte string is transformed into utf8. Once a string is in utf8, perl will always use the same definition of IsDigit, independent of locale.
tetromino
+3  A: 

The list of Unicode characters for each class is generated from the Unicode spec when you compile Perl, and is typically stored in /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/

For example, the list of Unicode character ranges that match IsDigit (a.k.a. \d) is stored in the file /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/Digit.pl

tetromino
Thank you, this is almost exactly what I was looking for. I will still have loop over them to build a list, but at least that won't take forever and a day.
Chas. Owens
A: 

There is no way to do that without iterating through all the characters. (if you create a huge string with all of them and use a regexp you still have to do the loop at least once, to create the string).

Mihai Nita
Happily part of the Perl build process creates a set of files under `unicore` in one of the lib directories that already have a lot of the work done for you. I don't know if they are official or not, I have a question in to the Perl 5 Porters list to find out if it safe to use them.
Chas. Owens
A: 

Even better than unicore/lib/gc_sc/Digit.pl is unicore/To/Digit.pl. It is a direct mapping of Unicode digit characters (well, really their offsets) to their numeric values. This means instead of:

use Unicode::Digits qw/digit_to_int/;

my @digits;
for (split "\n", require "unicore/lib/gc_sc/Digit.pl") {
    my ($s, $e) = map hex, split;
    for (my $ord = $s; $ord <= $e; $ord++) {
        my $chr = chr $ord;
        push @{$digits[digits_to_int $chr]}, $chr;
    }
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

I can say:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    my $chr = chr hex $ord;
    push @{$digits[$val]}, $chr;
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

Or even better:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    $digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;
Chas. Owens