tags:

views:

53

answers:

2

While answering another question, it was brought up that there might a difference, performance-wise, between an explicit character class ([0-9]) and a "shorthand" class (\d). My initial reaction was that if a difference exists at all, it'd be negligible, but I don't have (and couldn't find) any info about it or figure out how I could test this.

Similarly, would there be a non-negligible difference between [^0-9], [^\d] and \D?

+1  A: 

Well, not knowing the implementation (and not wanting to plow through hundreds of lines of code in Reflector), it's hard to know for sure, so let's try some wild speculation.

  • Postulate A) "[J-P]" is coded something like "JKLMNOP".Contains(chr)
  • Postulate B) Seaching for a digit can be done faster by Char.IsDigit(chr)

Given that, this boils down to two questions:

  • Is it likely Regex would treat \d as a special case and use IsDigit? - I think this is likely.
  • Is it likely Regex would recognize [0-9] as a special case distinct from [J-P] and use IsDigit instead of Contains? - I see this as possible, but unlikely.

So, I'd say there is a chance \d is faster than [0-9].

(Note: I'm using C#/.NET as examples, but the basic principles should be the same regardless of platform)

James Curran
+1  A: 

When in doubt, BENCHMARK!

I benchmarked a simple regex comparison in Perl. I found that \d+ is indeed faster. YMMV.

use strict; use warnings;
use Benchmark qw(:all);
use feature "switch";

my $r1='\d+';
my $r2='[0-9]+';
my $r3='[[:digit:]]+';

sub test {
    my @lines = <DATA>;
    $_=shift;
    my $RegEx=(caller(0))[3];
    given($_) {
        when(1) { $RegEx=$r1; }
        when(2) { $RegEx=$r2; }
        when(3) { $RegEx=$r3; }
        default { die "$RegEx can't deal with $_\n"; }
    }   
    my $ln;
    my $total;
    my @numbers;
    foreach my $line (@lines) {
        $total=0;
        @numbers=$line=~/($RegEx)/g;
        $total+=$_ for (@numbers) ;
        $ln=$numbers[$#numbers];
        $total-=$ln;
        if ($ln != $total) {
            print "Bad RegEx result: Last Num != total in line!\n";
            print "Total=$total, ln=$ln, line:$line";
            die;
        }       
    }   
}

cmpthese(-10, {$r1=>'test(1)', $r2=>'test(2)', $r3=>'test(3)'});


__DATA__
Clip clap clock 1 mouse ran up the clock with 3 hands. The total here is 4.
The mouse with 2 ears followed. The total here is 2.
After that, the 6 wiskered mouse did dances with 14 second timing. 20.
It is hard to make up 5 lines with 2 or 3 numbers in each line. 10.
You start thinking about nurserey rhymes with 1 o 2 or 3 number. 6.
1 12 13 123 23 13 55 66 21 45 1 373

I get the following results on OS X with Perl 5.10 64 bit:

                 Rate [[:digit:]]+       [0-9]+          \d+
[[:digit:]]+ 200781/s           --          -1%          -2%
[0-9]+       202831/s           1%           --          -1%
\d+          204605/s           2%           1%           --

And the following on Ubuntu 10.04 with Perl 5.10.1:

                 Rate [[:digit:]]+       [0-9]+          \d+
[[:digit:]]+ 264412/s           --          -3%          -6%
[0-9]+       273202/s           3%           --          -3%
\d+          280541/s           6%           3%           --
drewk
It's interesting that [[:digit:]], which is conceptually the same as \d came out as the slowest. And that, there's a correlation between length of the pattern and the run time. That indicates that there's no "compiling" step, and the pattern is constantly being re-interpreted.
James Curran
@James Curran: I did purposely write it so it was reinterpreted each time by holding the regex in a scalar without allowing `//o` or `qr//` to compile the regex. I rewrote the code a little to allow compiling the regex with qr//o to test. In the case, `[[:digit:]]+` and `[0-9]+` were the same and `\d+` was 3% faster.
drewk