views:

119

answers:

4

Which one of these is cheaper?

$_ = 'abc123def';

s/\d+/$&*2/e;
say;

s/(\d+)/$1*2/e;
say;
+10  A: 

From perldoc perlvar:

  • $MATCH
  • $&

The string matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). (Mnemonic: like & in some editors.) This variable is read-only and dynamically scoped to the current BLOCK.

The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS".

See "@-" for a replacement.

Even if this information weren't conveniently in the documentation, you could still time it yourself and find out.

Chris Lutz
Chris Lutz
A: 
use Benchmark;

and test.

Generally - really, really, it doesn't matter. Unless you're doing billions of these operations.

depesz
innaM
... which would make the use of the usual time_these function very problematic, btw.
innaM
+1  A: 

Here is an easy way to get some idea of the performance impact of using $&. First off, you need to create two benchmark scripts. They will have most of the code in common:

#!/usr/bin/perl

use strict;
use warnings;
use autodie;
use File::Spec::Functions qw( devnull );

open my $output, '>', devnull;

my $str = <<EO_LIPSUM;
Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.
EO_LIPSUM

use Benchmark qw( timethese );

For the first benchmark add

### benchmark with $MATCH

timethese -1, {
    match_var => sub {
        $str =~ /commodo/;
        print $output $&;
        $str =~ /^Lorem|ipsum/ and print $output 'yes';
    }
}

and for the second benchmark file, use

timethese -1, {
    capture => sub {
        $str =~ /(commodo)/;
        print $output $1;
        $str =~ /^Lorem|ipsum/ and print $output 'yes';
    }
}

Now, let's run these benchmarks (they must be in separate files):

Benchmark: running capture for at least 1 CPU seconds...
   capture:  1 wallclock secs ( 1.05 usr +  0.00 sys =  1.05 CPU) @ 301485.20/s
(n=315655)
Benchmark: running match_var for at least 1 CPU seconds...
 match_var:  1 wallclock secs ( 1.22 usr +  0.02 sys =  1.23 CPU) @ 255591.09/s
(n=315655)

That is, using $& caused a slow down of about 15% in this case. The slowdown is due to the impact of $& on the simple regular expression match. Without the

$str =~ /^Lorem|ipsum/ and print $output 'yes';

line, the version with $& actually performs faster.

Sinan Ünür
+10  A: 

Executive summary: use 5.010's /p instead. The performance of $& is about the same for a single match or substitution, but the entire program can suffer from it. It's slowdown is long-range, not local.


Here's a benchmark with 5.010, which I suspect you are using since you have say in there. Note that 5.010 has a new /p flag that supplies a ${^MATCH} variable that acts like $& but for only one instance of the match or substitution operator.

As with any benchmark, I compare with a control to set the baseline so I know how much time the boring bits take up. Also, this benchmark has a trap: you can't use $& in the code or every substitution suffers. First run the benchmark without the $& sub:

use 5.010;

use Benchmark qw(cmpthese);

cmpthese(1_000_000, {
   'control' => sub { my $_ = 'abc123def'; s/\d+/246/ },
   'control-e' => sub { my $_ = 'abc123def'; s/\d+/123*2/e;  },
   '/p'      => sub { my $_ = 'abc123def'; s/\d+/${^MATCH}*2/pe },
   # '$&'      => sub { my $_ = 'abc123def'; s/\d+/$&*2/e },
   '()'      => sub { my $_ = 'abc123def'; s/(\d+)/$1*2/e },
});

On my MacBook Air running Leopard and a vanilla Perl 5.10:

     Rate        /p        () control-e   control
/p         70621/s        --       -1%      -58%      -78%
()         71124/s        1%        --      -58%      -78%
control-e 168350/s      138%      137%        --      -48%
control   322581/s      357%      354%       92%        --

Notice the big slowdown with the /e option, which I've added just for giggles.

Now, I'll uncomment the $& branch, and I see that everything is slower, although /p seems to shihe here:

     Rate        ()        $&        /p control-e   control
()         68353/s        --       -4%       -7%      -58%      -74%
$&         70872/s        4%        --       -3%      -56%      -73%
/p         73421/s        7%        4%        --      -54%      -72%
control-e 161290/s      136%      128%      120%        --      -39%
control   262467/s      284%      270%      257%       63%        --

This is an odd benchmark. If I don't include the control-e sub, the situation looks different, which demonstrates another concept of benchmarking: it's not absolute and everything that you do matters in the final results. In this run, $& looks slightly faster:

   Rate      ()      /p      $& control
()       69686/s      --     -3%     -3%    -72%
/p       72098/s      3%      --     -0%    -71%
$&       72150/s      4%      0%      --    -71%
control 251256/s    261%    248%    248%      --

So, I ran it with control-e again, and the results move around a little:

     Rate        ()        /p        $& control-e   control
()         68306/s        --       -3%       -4%      -55%      -74%
/p         70175/s        3%        --       -1%      -54%      -73%
$&         71023/s        4%        1%        --      -53%      -73%
control-e 151976/s      122%      117%      114%        --      -41%
control   258398/s      278%      268%      264%       70%        --

The speed differences in each aren't impressive either. Anything under about 7% isn't that significant since that difference comes the accumulation of errors through the repeated calls to the sub (try it sometime by benchmarking the same code against itself). The slight differences you see come merely from the benchmarking infrastructure. With these numbers, each technique is virtually the same speedwise. You can't just run your benchmark once. You have to run it several times to see if you get repeatable results.

Note that although the /p looks very slightly slower, it's also slower because $& cheats by messing up everyone. Notice the slow down in the control too. This is one of the reasons that benchmarking is so dangerous. You can easily mislead yourself with the results if you don't think hard about why they are wrong (see the full screed in Mastering Perl, where I devote an entire chapter to this.)

This simple and naïve benchmark excludes the killer disfeature of $&, though. Let's modify the benchmark to handle an additional match. First, the baseline with no $& effects, where I've constructed a situation where $& would have to copy about 1,000 characters in an additional match operator:

use 5.010;

use Benchmark qw(cmpthese);

$main::long = ( 'a' x 1_000 ) . '123' . ( 'b' x 1_000 );

cmpthese(1_000_000, {
   'control' => sub { my $_ = 'abc123def'; s/\d+/246/; $main::long =~ m/^a+123/; },
   'control-e' => sub { my $_ = 'abc123def'; s/\d+/123*2/e; $main::long =~ m/^a+123/; },
   '/p'      => sub { my $_ = 'abc123def'; s/\d+/${^MATCH}*2/pe; $main::long =~ m/^a+123/; },
   #'$&'      => sub { my $_ = 'abc123def'; s/\d+/$&*2/e; $main::long =~ m/^a+123/;},
   '()'      => sub { my $_ = 'abc123def'; s/(\d+)/$1*2/e; $main::long =~ m/^a+123/; },
});

Everything is much slower than before, but that's what happens when you do more work, and again the two techniques are within each other's noise:

     Rate        ()        /p control-e   control
()         52826/s        --       -4%      -49%      -63%
/p         54885/s        4%        --      -47%      -61%
control-e 103734/s       96%       89%        --      -27%
control   141243/s      167%      157%       36%        --

Now, I uncomment the $& sub:

     Rate        ()        $&        /p control-e   control
()         50607/s        --       -1%       -3%      -43%      -59%
$&         50968/s        1%        --       -2%      -43%      -58%
/p         52274/s        3%        3%        --      -41%      -57%
control-e  89206/s       76%       75%       71%        --      -27%
control   122100/s      141%      140%      134%       37%        --

That result is very interesting. Now /p, still penalized by the cheating $&, is slightly faster (although still within the noise), although everyone suffers significantly.

Again, be very careful with these results. This does not mean that for every script, $& will have the same effect. You might seem less of a slowdown, or more of it, depending on the number of matches, the particular regexes, and so on. What this, or any, benchmark shows is an idea, not a decision. You still have to figure out how this idea affects your particular situation.

brian d foy
+1 for Nice bench !
sebthebert