tags:

views:

345

answers:

8

I have a growing list of regular expressions that I am using to parse through log files searching for "interesting" error and debug statements. I'm currently breaking them into 5 buckets, with most of them falling into 3 large buckets. I have over 140 of patterns so far, and the list is continuing to grow.

Most of the regular expressions are simple, but they're also fairly unique, so my opportunities to catch multiple matches with a single pattern are few and far between. Because of the nature of what I'm matching, the patterns tend to be obscure and therefor seldom matched against, so I'm doing a TON of work on each input line with the end result being that it fails to match anything, or matches one of the generic ones at the very end.

And because of the quantity of input (hundreds of megabytes of log files) I'm sometimes waiting for a minute or two for the script to finish. Hence my desire for a more efficient solution. I'm not interested in sacrificing clarity for speed, though.

I currently have the regular expressions set up like this:

 if (($line =~ m{Failed in routing out}) ||
  ($line =~ m{Agent .+ failed}) ||
  ($line =~ m{Record Not Exist in DB}) ||
         ...

Is there a better way of structuring this so it's more efficient, yet still maintainable? Thanks!

+5  A: 

You can combine your regexes with the alternation operator |, as in: /pattern1|pattern2|pattern3/

Obviously, it won't be very maintainable if you put all of them in a single line, but you've got options to mitigate that.

  • You can use the /x regex modifier to space them nicely, one per line. A word of caution if you choose this direction: you'll have to explicitely specify the space characters you expect, otherwise they'd be be ignored because of the /x.
  • You can generate your regular expression at run-time, by combining individual sources. Something like this (untested):

    my $regex = join '|', @sources;
    while (<>) {
        next unless /$regex/o;
        say;
    }
    
JB
+1 This can actually be more efficient than doing them separately. The regexp engine can be intelligent and pull out common prefixes that can speed things up (this is true for perl 5.10 at least, and possibly before).
Chris Simmons
/o is obsolete/deprecated. Use qr//. And I think in modern perls, as long as $regex doesn't change, then it won't be recompiled anyway. But don't quote me :-)
runrig
@runrig I do recall reading something about /o being deprecated, but couldn't find anything about it in the docs. As far as I can remember, it's necessary in 5.8.8 (how modern is that? don't quote me either). qr// avoids the issue, but it's further away from the next logical step in my reasoning: reading the regex list from other sources (e.g., separate file)
JB
A: 

One possible solution is to let the regex state machine do the checking of alternatives for you. You'll have to benchmark to see if the result is noticeably more efficient, but it will certainly be more maintainable.

First, you'd maintain a file containing one pattern of interest per line.

Failed in routing out
Agent .+ failed
Record Not Exist in DB

Then you'd read in that file at the beginning of your run, and construct a large regular expression using the "alternative" operator, "|"

open(PATTERNS,"<foo.txt") or die $!;
@patterns = <PATTERNS>;
close PATTERNS or die $!;
chomp @patterns;
$matcher = join('|', @patterns);

while (<MYLOG>) {
    print if $_ =~ $matcher;
}
Jonathan Feinberg
+1  A: 

Maybe something like:

my @interesting = (
  qr/Failed in routing out/,
  qr/Agent .+ failed/,
  qr/Record Not Exist in DB/,
);

...


for my $re (@interesting) {
  if ($line =~ /$re/) {
    print $line;
    last;
  }
}

You can try joining all your patterns with "|" to make one regex. That may or may not be faster.

runrig
+4  A: 

You might want to get rid of the large if statement:

my @interesting = (
  qr/Failed in routing out/,
  qr/Agent .+ failed/,
  qr/Record Not Exist in DB/,
);

return unless $line =~ $_ for @interesting;

although I cannot promise this will improve anything w/o benchmarking with real data.

It might help if you can anchor your patterns at the beginning so they can fail more quickly.

Sinan Ünür
Sinan - +1 but please remember that he's parsing log files, so the naked strings won't be anchorable, most likely. But hopefully his logs have uniform prefixes so he can anchor the timestamp or whatever prefix looks like.
DVK
Indeed, I'm usually very religious about anchoring my strings, but as DVK says, the strings can and will be all over the place (especially as I'm parsing different types of log files). The timestamp RE is the first one I do and is of course anchored (once I figure out which timestamp RE to use).
Joe Casadonte
Log lines usually have some defined fields. I was thinking along the lines of removing those fields first before looking at possible matches.
Sinan Ünür
+1  A: 

Your example regular expressions look like they are based mainly on ordinary words and phrases. If that's the case, you might be able to speed things up considerably by pre-filtering the input lines using index, which is much faster than a regular expression. Under such a strategy, every regular expression would have a corresponding non-regex word or phrase for use in the pre-filtering stage. Better still would be to skip the regular expression test entirely, wherever possible: two of your example tests do not require regular expressions and could be done purely with index.

Here is an illustration of the basic idea:

use strict;
use warnings;

my @checks = (
    ['Failed',    qr/Failed in routing out/  ],
    ['failed',    qr/Agent .+ failed/        ],
    ['Not Exist', qr/Record Not Exist in DB/ ],
);
my @filter_strings = map { $_->[0] } @checks;
my @regexes        = map { $_->[1] } @checks;

sub regex {
    my $line = shift;
    for my $reg (@regexes){
        return 1 if $line =~ /$reg/;
    }
    return;
}

sub pre {
    my $line = shift;
    for my $fs (@filter_strings){
        return 1 if index($line, $fs) > -1;
    }
    return;
}

my @data = (
    qw(foo bar baz biz buz fubb),
    'Failed in routing out.....',
    'Agent FOO failed miserably',
    'McFly!!! Record Not Exist in DB',
);

use Benchmark qw(cmpthese);
cmpthese ( -1, {
    regex => sub { for (@data){ return $_ if(            regex($_)) } },
    pre   => sub { for (@data){ return $_ if(pre($_) and regex($_)) } },
} );

Output (results with your data might be very different):

             Rate     regex prefilter
regex     36815/s        --      -54%
prefilter 79331/s      115%        --
FM
You should use '.+?' in regex.
Alexandr Ciornii
+2  A: 

This is handled easily with Perl 5.10

use strict;
use warnings;
use 5.10.1;

my @matches = (
  qr'Failed in routing out',
  qr'Agent .+ failed',
  qr'Record Not Exist in DB'
);

# ...

sub parse{
  my($filename) = @_;

  open my $file, '<', $filename;

  while( my $line = <$file> ){
    chomp $line;

    # you could use given/when
    given( $line ){
      when( @matches ){
        #...
      }
    }

    # or smartmatch
    if( $line ~~ @matches ){
      # ...
    }
  }
}

You could use the new Smart-Match operator ~~.

if( $line ~~ @matches ){ ... }

Or you can use given/when. Which performs the same as using the Smart-Match operator.

given( $line ){
  when( @matches ){
    #...
  }
}
Brad Gilbert
+3  A: 

You might want to take a look at Regexp::Assemble. It's intended to handle exactly this sort of problem.

Boosted code from the module's synopsis:

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
$ra->add( 'ab+c' );
$ra->add( 'ab+-' );
$ra->add( 'a\w\d+' );
$ra->add( 'a\d+' );
print $ra->re; # prints a(?:\w?\d+|b+[-c])

You can even slurp your regex collection out of a separate file.

daotoad
Definitely the way to go. I have an app currently in production which uses Regexp::Assemble to compare incoming text strings against a list of 1,334 terms to see which (if any) of them are in each string. The code is simple as hell and runs nice and fast.
Dave Sherohman
+3  A: 

From perlfaq6's answer to How do I efficiently match many regular expressions at once?


How do I efficiently match many regular expressions at once?

( contributed by brian d foy )

Avoid asking Perl to compile a regular expression every time you want to match it. In this example, perl must recompile the regular expression for every iteration of the foreach loop since it has no way to know what $pattern will be.

@patterns = qw( foo bar baz );

LINE: while( <DATA> )
 {
 foreach $pattern ( @patterns )
  {
  if( /\b$pattern\b/i )
   {
   print;
   next LINE;
   }
  }
 }

The qr// operator showed up in perl 5.005. It compiles a regular expression, but doesn't apply it. When you use the pre-compiled version of the regex, perl does less work. In this example, I inserted a map to turn each pattern into its pre-compiled form. The rest of the script is the same, but faster.

@patterns = map { qr/\b$_\b/i } qw( foo bar baz );

LINE: while( <> )
 {
 foreach $pattern ( @patterns )
  {
  if( /$pattern/ )
   {
   print;
   next LINE;
   }
  }
 }

In some cases, you may be able to make several patterns into a single regular expression. Beware of situations that require backtracking though.

$regex = join '|', qw( foo bar baz );

LINE: while( <> )
 {
 print if /\b(?:$regex)\b/i;
 }

For more details on regular expression efficiency, see Mastering Regular Expressions by Jeffrey Freidl. He explains how regular expressions engine work and why some patterns are surprisingly inefficient. Once you understand how perl applies regular expressions, you can tune them for individual situations.

brian d foy