tags:

views:

201

answers:

5

I want to capture several text using the following regex:

$text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};

A sample of the string is like below:

my $text = '/F12345 FF FF this is SCF SF really MV (important stuff SH';

Can that be rewritten to speed up the matching?

+6  A: 

This very much depends on the profile of the data you are scanning.

What you can do is identify the piece of your RE which filters out the most input and do a separate simpler RE for that expression.

For instance if only 5% of your input date contained the 'MV' string, you could filter for this first and only apply the full more complex RE if the simpler one is true

So you would have:

if ( $text_normal =~ / MV / ) {
  $text_normal = qr{^(\/F\d+) FF (.*?) SCF SF (.*?) MV (\(.*?)SH$};
  if .......
  }
}
James Anderson
If you're just checking for a static string in your first pass to weed things out, consider benchmarking whether `index(' MV ')` is faster than doing a regex match for it.
Dave Sherohman
+6  A: 

Without seeing some sample data it is hard to say.

Generally, it is a good idea to avoid using .*. Look for any possible sources of unneeded backtracking, and eliminate them.

You might be able to get away with a split with a slice if your needs are simple.

 my @vals = (split / /, $string)[0,2,5,7];
daotoad
+10  A: 

There's no single answer to optimizing a regex. You can watch what a particular regex is doing with the re pragma:

 use re 'debugcolor';

Once you see what it traverses the string, you see where it is having problems and adjust your regex from there. You'll learn a little about the regex engine as you do that.

You should also check out Mastering Regular Expressions, which tells you how regular expressions work and why some patterns are slower than others.

brian d foy
+1 You learn something new everyday!
draegtun
A: 

Backtracking is one of the surest ways to kill regex performance, but, unfortunately, this does not appear to be a case where you're able to completely eliminate the . wildcard in favor of character classes, unless the text you're capturing is prohibited from containing uppercase characters. (If that prohibition does exist, you could replace your .*? with, say, [a-z ]* instead.) You can still reduce the possibility for backtracking by using {} to set a minimum/maximum number of characters to match, such as .{0,10}? if the match can't be longer than 10 characters.

Dave Sherohman
+2  A: 

(.*) means that you are dealing with any number of repetitions of " SCF SF " before you find the one that indicates it's the next capture, by making it non-greedy you're still handling the capability that even 'SCF SF' would appear in the capture after 'FF'. I think you are handling a lot of cases you don't need.

The best way to optimize a regular expression sometimes makes it more cryptic--but you definitely find ways to make the expression fail earlier. (.*?) while not being "greedy" is definitely too tolerant.

Below is a more verbose, but faster failing alternative to your second capture.

((?:[^S]|S[^C]|SC[^F]|SCF[^ ]|SCF [^S]|SCF S[^F])*)

But you can optimize it even more if you think that the string \bSCF\b should automatically make the capture commit and expect only "\bSCF SF\b". Thus you could re-write that as:

((?:[^S]|S[^C]|SC[^F]SCF\B)*) SCF SF

But you can optimize these strings even more by backtracking control. If you think that there is no way in the world that SCF would ever occur as a word and not be followed by SF on valid input. To do that, you add another group around it, with brackets (?> and ).

(?>((?:[^S]|S[^C]|SC[^F]SCF\B)*)) SCF SF

That means that the matching logic will in no way try to reassess what it captured. If the characters after this fail to be " SCF SF " the whole expression fails. And it fails long before it ever gets to try to accommodate "MV" and other sub-expressions.

In fact, given certain expressions about the uniqueness of the delimiters, the fastest performance for this expression would be:

$text_normal = qr{^(\/F\d+) FF (?>((?:[^S]|S[^C]|SC[^F]SCF\B)*))SCF SF (?>((?:[^M]|M[^V]|MV\B)*))MV (?>(\((?:[^S]|S[^H]|SH.)*))SH$};

Additionally, the verbose, exhaustive negative matches can be alternative expressed by negative lookaheads--but I have no idea on how that works on performance. But negative look aheads would work like this:

((?:.(?! SCF))*) SCF SF

This means that for this capture I want any character that is not a space starting the string " SCF SF ".

Axeman