ansaurus

Question

Performance of character classes vs "shorthand" classes

Answer 1

+1 A:

Well, not knowing the implementation (and not wanting to plow through hundreds of lines of code in Reflector), it's hard to know for sure, so let's try some wild speculation.

Postulate A) "[J-P]" is coded something like "JKLMNOP".Contains(chr)
Postulate B) Seaching for a digit can be done faster by Char.IsDigit(chr)

Given that, this boils down to two questions:

Is it likely Regex would treat \d as a special case and use IsDigit? - I think this is likely.
Is it likely Regex would recognize [0-9] as a special case distinct from [J-P] and use IsDigit instead of Contains? - I see this as possible, but unlikely.

So, I'd say there is a chance \d is faster than [0-9].

(Note: I'm using C#/.NET as examples, but the basic principles should be the same regardless of platform)

James Curran 2010-08-18 21:02:09

Answer 2

+1 A:

When in doubt, BENCHMARK!

I benchmarked a simple regex comparison in Perl. I found that \d+ is indeed faster. YMMV.

use strict; use warnings;
use Benchmark qw(:all);
use feature "switch";

my $r1='\d+';
my $r2='[0-9]+';
my $r3='[[:digit:]]+';

sub test {
    my @lines = <DATA>;
    $_=shift;
    my $RegEx=(caller(0))[3];
    given($_) {
        when(1) { $RegEx=$r1; }
        when(2) { $RegEx=$r2; }
        when(3) { $RegEx=$r3; }
        default { die "$RegEx can't deal with $_\n"; }
    }   
    my $ln;
    my $total;
    my @numbers;
    foreach my $line (@lines) {
        $total=0;
        @numbers=$line=~/($RegEx)/g;
        $total+=$_ for (@numbers) ;
        $ln=$numbers[$#numbers];
        $total-=$ln;
        if ($ln != $total) {
            print "Bad RegEx result: Last Num != total in line!\n";
            print "Total=$total, ln=$ln, line:$line";
            die;
        }       
    }   
}

cmpthese(-10, {$r1=>'test(1)', $r2=>'test(2)', $r3=>'test(3)'});


__DATA__
Clip clap clock 1 mouse ran up the clock with 3 hands. The total here is 4.
The mouse with 2 ears followed. The total here is 2.
After that, the 6 wiskered mouse did dances with 14 second timing. 20.
It is hard to make up 5 lines with 2 or 3 numbers in each line. 10.
You start thinking about nurserey rhymes with 1 o 2 or 3 number. 6.
1 12 13 123 23 13 55 66 21 45 1 373

I get the following results on OS X with Perl 5.10 64 bit:

                 Rate [[:digit:]]+       [0-9]+          \d+
[[:digit:]]+ 200781/s           --          -1%          -2%
[0-9]+       202831/s           1%           --          -1%
\d+          204605/s           2%           1%           --

And the following on Ubuntu 10.04 with Perl 5.10.1:

                 Rate [[:digit:]]+       [0-9]+          \d+
[[:digit:]]+ 264412/s           --          -3%          -6%
[0-9]+       273202/s           3%           --          -3%
\d+          280541/s           6%           3%           --

drewk 2010-08-19 03:03:00

It's interesting that [[:digit:]], which is conceptually the same as \d came out as the slowest. And that, there's a correlation between length of the pattern and the run time. That indicates that there's no "compiling" step, and the pattern is constantly being re-interpreted.

James Curran 2010-08-19 13:54:03

@James Curran: I did purposely write it so it was reinterpreted each time by holding the regex in a scalar without allowing `//o` or `qr//` to compile the regex. I rewrote the code a little to allow compiling the regex with qr//o to test. In the case, `[[:digit:]]+` and `[0-9]+` were the same and `\d+` was 3% faster.

drewk 2010-08-19 14:39:46

ansaurus

tags:

views:

answers:

Performance of character classes vs "shorthand" classes

related questions