tags:

views:

377

answers:

4

The Perl FAQ entry How do I strip blank space from the beginning/end of a string? states that using

s/^\s+|\s+$//g;

is slower than doing it in two steps:

s/^\s+//;
s/\s+$//;

Why is this combined statement noticeably slower than the separate ones (for any input string)?

+1  A: 

If this is indeed the case, then it would be because the regex engine is able to optimize better for the individual regexes than for the combined one.

What do you mean by "noticeably slower"?

Anon.
@Anon: about 5 times slower on a 25 chars string
eugene y
@Anon. See the FAQ http://perldoc.perl.org/perlfaq4.html#How-do-I-strip-blank-space-from-the-beginning/end-of-a-string%3f
Sinan Ünür
Noticeably slower means you can notice the difference, and in this case you can see a huge speed penalty.
brian d foy
+3  A: 

Since the two methods are logically equivalent, there's no inherent reason for them to differ in evaluation performance. In practice, however, some engines won't be able to spot optimizations in more complex regexes.

In this case, the combined regex as a whole is unanchored, so it could potentially match at any point in the string, while the ^\s+ is anchored at the start, so it is trivial to match, and \s+$ is anchored at the end, and provides a single character class for each character from the end backwards - a well-optimized engine will recognize that fact and will match in reverse, which makes it as trivial as a ^\s+ match on the reverse of the input.

Max Shawabkeh
+11  A: 

The Perl regex runtime runs much quicker when working with 'fixed' or 'anchored' substrings rather than 'floated' substrings. A substring is fixed when you can lock it to a certain place in the source string. Both '^' and '$' provide that anchoring. However, when you use alternation '|', the compiler doesn't recognize the choices as fixed, so it uses less optimized code to scan the whole string. And at the end of the process, looking for fixed strings twice is much, much faster than looking for a floating string once. On a related note, reading perl's regcomp.c will make you go blind.

Update: Here's some additional details. You can run perl with the '-Dr' flag if you've compiled it with debugging support and it'll dump out regex compilation data. Here's what you get:

~# debugperl -Dr -e 's/^\s+//g'
Compiling REx `^\s+'
size 4 Got 36 bytes for offset annotations.
first at 2
synthetic stclass "ANYOF[\11\12\14\15 {unicode_all}]".
   1: BOL(2)
   2: PLUS(4)
   3:   SPACE(0)
   4: END(0)
stclass "ANYOF[\11\12\14\15 {unicode_all}]" anchored(BOL) minlen 1

# debugperl -Dr -e 's/^\s+|\s+$//g'
Compiling REx `^\s+|\s+$'
size 9 Got 76 bytes for offset annotations.

   1: BRANCH(5)
   2:   BOL(3)
   3:   PLUS(9)
   4:     SPACE(0)
   5: BRANCH(9)
   6:   PLUS(8)
   7:     SPACE(0)
   8:   EOL(9)
   9: END(0)
minlen 1 

Note the word 'anchored' in the first dump.

sidereal
You don't actually need a debugging build of perl. You need it for `-Dr` to work, but without it you can `use re 'debug'` which is prettier anyway.
hobbs
+8  A: 

Other answers have indicated that the fully anchored regexes allow the engine to optimize the search process, focusing on just the beginning or the end or the string. It appears that you can see the effect of this optimization by comparing the speed difference of the two approaches using strings of various lengths. As the string gets longer, the "floating" regex (using alternation) suffers more and more.

use strict;
use warnings;
use Benchmark qw(cmpthese);

my $ws = "   \t\t\n";

for my $sz (1, 10, 100, 1000){
    my $str = $ws . ('Z' x $sz) . $ws;
    cmpthese(-2, {
        "alt_$sz" => sub { $_ = $str; s/^\s+|\s+$//g },
        "sep_$sz" => sub { $_ = $str; s/^\s+//; s/\s+$// },
    });
}

           Rate alt_1 sep_1
alt_1  870578/s    --  -16%
sep_1 1032017/s   19%    --

            Rate alt_10 sep_10
alt_10  384391/s     --   -62%
sep_10 1010017/s   163%     --

            Rate alt_100 sep_100
alt_100  61179/s      --    -92%
sep_100 806840/s   1219%      --

             Rate alt_1000 sep_1000
alt_1000   6612/s       --     -97%
sep_1000 261102/s    3849%       --
FM