views:

272

answers:

7

I'm parsing some big log files and have some very simple string matches for example

if(m/Some String Pattern/o){
    #Do something
}

It seems simple enough but in fact most of the matches I have could be against the start of the line, but the match would be "longer" for example

if(m/^Initial static string that matches Some String Pattern/o){
    #Do something
}

Obviously this is a longer regular expression and so more work to match. However I can use the start of line anchor which would allow an expression to be discarded as a failed match sooner.

It is my hunch that the latter would be more efficient. Can any one back me up/shoot me down :-)

+4  A: 

I think you'll find that starting your regex with ^ will definitely be faster, because the regex engine doesn't have to look any further than the left edge of the string for a match.

This is something that you could easily test and measure, of course. Do a regex match 10 million times or so, measure how long it takes, then try again with a different regex.

Greg Hewgill
Benchmarking proves otherwise. Please see my test below...
drewk
A: 

I vote for the one anchored at the beginning for exactly the reason you state!

scooterXL
+4  A: 

The line anchor makes it faster. I have to add though that the //o modifier is not necessary here, in fact it does nothing. That's code smell to me.

There used to be valid usages for //o, but these days that is provided by qr//

Leon Timmermans
+3  A: 

Speed of an RE depends on two factors, the RE itself and the data being passed through the RE. In general, an anchored RE (start or end) with no backtracking will be faster than others. But if you're processing a file where every line is empty, there's no speed difference between /^hello/ and /hello/ (at least if the RE engine is written correctly).

But the rule I follow is: measure, don't guess.

paxdiablo
+3  A: 

I did some timings as recommended. here are the results for my app. Its the whole app, not just the regex searches. It scans 60,000 lines. 11 Regular expressions average short length was about 30 characters. The longer but anchored ones are about 120.

Short
   real    0m58.780s
   user    0m54.940s
   sys     0m0.790s

Long (anchored)
   real    0m54.260s
   user    0m53.630s
   sys     0m0.490s

Long (not anchored)
   real    0m54.705s
   user    0m54.130s
   sys     0m0.400s

So anchoring the long strings is slightly faster. Although not by much. It would appear that if my strings were any larger it might be a different matter.

Vagnerr
The difference between the times of Long(anchored) and Long(not anchored) is small enough to fall within the noise range of benchmarking -- unless you're running in single-user mode with no other processes running. The responsible conclusion is that it makes no difference.
Schwern
A: 

Are you saying you can anchor the regex by adding a static prefix, like this?

/^blah blah The Real Regex/

That certainly won't hurt performance, and it will probably help, but not for the reason you think. Although they're best known for the "magical" stuff like anchors and lookarounds and capturing groups, what regex engines are best at is matching literal sequences of characters. The longer the sequence, the faster the match (up to a point, of course).

In other words, it's the addition of the static prefix, not the anchor, that's giving you the boost.

Alan Moore
+1  A: 

You can gain tremendous insight into what the regex engine is doing in Perl with the use re debug pragma. It is documented here

It is always helpful to review the Perl suggested performance techniques, including suggested timing methods.

If I run this small test:

#!/usr/bin/perl 

use strict;
use warnings;
use Benchmark;

my $target="aeiou";

my $str="lkdjflzdjfljdsflkjasdjf asldkfj lasdjf dslfj sldfj asld alskdfj lasd f";

my $str2=$str.$target;

timethese(10_000_000, {
            'float'       => sub {
                die "no match" unless $str2=~m/$target/o;
            },
            'anchored'  => sub {
                die "no match" unless $str2=~m/^.*$target/o;
            },
            'prefixed'   => sub {
                die "no match" unless $str2=~m/^$str$target/o ;
            },  

    });

I get the output of:

Benchmark: timing 10000000 iterations of anchored, float, prefixed...
  anchored:  4 wallclock secs ( 3.46 usr +  0.01 sys =  3.47 CPU) @ 2881844.38/s 
     float:  2 wallclock secs ( 1.87 usr +  0.00 sys =  1.87 CPU) @ 5347593.58/s 
  prefixed:  4 wallclock secs ( 3.05 usr +  0.01 sys =  3.06 CPU) @ 3267973.86/s 

Which leads to the conclusion that non-anchored (floating) version is way faster. However, the regex and the source may change that. YMMV and test test test...

drewk
Your benchmark shows that `/$target/` is faster than `/^.*$target/`, which I would intuitively expect (but it's always a good idea to measure). The OP's case of an anchored literal string without wildcards is likely to show different performance characteristics than what you've measured.
Greg Hewgill
But look at 'prefixed' which is precisely what the OP was asking about: 1) anchored string, 2) prefixed with a string literal, 3) target well into some string literal, 4) no wildcards. So the anchored 'prefixed' version is as slow as the wildcard version. Floating string is faster than anchored prefixed string (the OP) and anchored wildcard.
drewk