views:

98

answers:

3

Whilst discussing the relative merits of using index() in Perl to search for substrings I decided to write a micro benchmark to prove what I had seen before than index is faster than regular expressions when looking for a substring. Here is the benchmarking code:

use strict;
use warnings;
use Benchmark qw(:all);

my @random_data;
for (1..100000) {
  push(@random_data, int(rand(1000)));
}

my $warn_about_counts = 0;
my $count = 100;
my $search = '99';

cmpthese($count, {
  'Using regex' => sub {
    my $instances = 0;
    my $regex = qr/$search/;
    foreach my $i (@random_data) {
      $instances++ if $i =~ $regex;
    }
    warn $instances if $warn_about_counts;
    return;
  },
  'Uncompiled regex with scalar' => sub {
    my $instances = 0;
    foreach my $i (@random_data) {
      $instances++ if $i =~ /$search/;
    }
    warn $instances if $warn_about_counts;
    return;
  },
  'Uncompiled regex with literal' => sub {
    my $instances = 0;
    foreach my $i (@random_data) {
      $instances++ if $i =~ /99/;
    }
    warn $instances if $warn_about_counts;
    return;
  },
  'Using index' => sub {
    my $instances = 0;
    foreach my $i (@random_data) {
      $instances++ if index($i, $search) > -1;
    }
    warn $instances if $warn_about_counts;
    return;
  },
});

What I was surprised at was how these performed (using Perl 5.10.0 on a recent MacBook Pro). In descending order of speed:

  1. Uncompiled regex with literal (69.0 ops/sec)
  2. Using index (61.0 ops/sec)
  3. Uncompiled regex with scalar (56.8 ops/sec)
  4. Using regex (17.0 ops/sec)

Can anyone offer an explanation as to what voodoo Perl is using to get the speed of the two uncomplied regular expressions to perform as well as the index operation? Is it an issue in the data I've used to generate the benchmark (looking for the occurrence of 99 in 100,000 random integers) or is Perl able to do a runtime optimisation?

+2  A: 

Well, your case "Using regex" is so slow because you are compiling it each time. Try moving it out of the subroutine.

Chas. Owens
But isn't the OP specifically trying to compare compilation speed? So pulling that out of the loop is not helpful here, it renders the benchmarking moot. :)
Ether
@Ether But the is no point in using a compiled regex that way. Compiled regexes only have two benefits: they are compiled once and they are properly wrapped in non-capturing groups that maintain the options you gave them even when interpolated into other regexes (e.g. `my $any_foo = qr/foo/i; my $any_foo_lower_bar = qr/${any_foo}bar;`)
Chas. Owens
Moving the regular expression outside of the subroutine makes no difference to the benchmark.
andeyatz
+2  A: 

Wholesale revision

In light of @Ven'Tatsu's comment, I changed the benchmark a bit:

use strict; use warnings;
use Benchmark qw(cmpthese);

use Data::Random qw( rand_words );
use Data::Random::WordList;

my $wl = Data::Random::WordList->new;

my @data_1 = (rand_words( size => 10000 )) x 10;
my @data_2 = @data_1;

my $pat = 'a(?=b)';
my $re = qr/^$pat/;

cmpthese(1, {
    'qr/$search/' => sub {
        my $instances = grep /$re/, @data_1;
        return;
    },
    'm/$search/' => sub {
        my $search = 'a(?=b)';
        my $instances = grep /^$search/, @data_2;
        return;
    },
});

On Windows XP with ActiveState perl 5.10.1:

              Rate qr/$search/  m/$search/
qr/$search/ 5.40/s          --        -73%
m/$search/  20.1/s        272%          --

On Windows XP with Strawberry perl 5.12.1:

              Rate qr/$search/  m/$search/
qr/$search/ 6.42/s          --        -66%
m/$search/  18.6/s        190%          --

On ArchLinux with bleadperl:

              Rate qr/$search/  m/$search/
qr/$search/ 9.25/s          --        -38%
m/$search/  14.8/s         60%          --
Sinan Ünür
Regexp compilation got some attention in 5.10.x, under 5.8.8 `qr/$search/` and `m/$search/` are within 1% of each other. An additional issue is the complexity of the pattern. Changing the pattern to `9{2}` pushes `qr/$search/` just slightly ahead for me. Any more complex expression (like `9(?=9).`) and it becomes the clear winner over `m/$search/`.
Ven'Tatsu
I'm curious if /o makes a difference at all, but not curious enough to actually try right now. :)
brian d foy
Thanks for pointing out the `Data::Random::WordList` module @Sinan which would give me more realistic results. As to timings the point was to build a benchmark which would compare using index to detect a substring to using a regular expression for the same purpose. Therefore changing the regex is really shifting the goal-posts
andeyatz
@andeyatz Check the previous versions of my post for timings of `qr` versus `index`. The peculiar thing to me is not any difference between `index` versus pattern matching. The counter-intuitive outcome of the first benchmarks was the fact that `qr/$var/` turned out to be much slower than `m/$var/` even though the compiled patters and the actual output of `re debug` seems identical to me.
Sinan Ünür
@Sinan sorry for not noting that earlier on. This is identical to what I'm seeing then. When I turned on `re debug` it seemed to say that all regular expressions were equal but the micro-benchmark doesn't lie. It seems that there is a significant penalty for using a `qr//` regex over an inline one. Could this be caused by using a thread compiled version of Perl?
andeyatz
+1  A: 

Perl optimizes a lot of things. Your pattern with no special regex features and literal characters allows perl's regex engine to simplify many things. Using use re 'debug' can show you what's actually happening behind the scenes.

brian d foy
I would expect matching against a precompiled `qr/$var/` to represent an upper bound on how fast `m/$var/` can be, but maybe my expectation is wrong.
Sinan Ünür
All regexs compile in an identical manner according to the `re` pragma
andeyatz