tags:

views:

112

answers:

3

I use Regexp::Assemble in my project, but I don't understand why this little sample doesn't work:

#!/usr/bin/perl

use strict;
use warnings;

use Regexp::Assemble;

my $re1 = "(run (?:pre|post)flight script for .+)";
my $re2 = "((?:Configu|Prepa)ring volume .+)";

my $ra   = Regexp::Assemble->new;
$ra->add($re1);
$ra->add($re2);
my $global = $ra->re;

print "GLOBAL: $global\n";

1;

I got this error:

Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE ?:(run (?:pre|post)flight script for|((?:Configu|Prepa)ring volume) .+)/ at /usr/share/perl5/Regexp/Assemble.pm line 1003.

Edit: If I just print the resulting Regexp ($ra->as_string) I got this:

GLOBAL: (?:(run (?:pre|post)flight script for|((?:Configu|Prepa)ring volume) .+)

There is one ')' missing...

+3  A: 

This looks like a bug? You are confusing the regex constructor. See how it combined your two patterns and mismatched the parentheses:

my $re1 =     "(run (?:pre|post)flight script for .+)";
my $re2 =                                        "((?:Configu|Prepa)ring volume .+)";

#         m/(?:(run (?:pre|post)flight script for|((?:Configu|Prepa)ring volume) .+)/ at...

Try removing the extra set of parentheses from your regexes and see if that helps:

my $re1 = "run (?:pre|post)flight script for .+";
my $re2 = "(?:Configu|Prepa)ring volume .+";
Ether
Yes, it's working without the extra parentheses... but here is just one example, I need these in more complex regexps !
sebthebert
Well, this answers the question as you have written it... :) maybe edit your question to give an example of so mething more complicated?
Ether
Ok, I should add "and how can I fix that without modifying my regexps" :) Any idea ?
sebthebert
As per martin clayton's quote, I suspect you will probably have to modify your regexps. Is it possible you can compose them differently or call add() in a different order?
Ether
+5  A: 

Ether's approach seems like a plan - If you look at the module documentation it mentions specifically to watch out:

add() ... It uses a naive regular expression to lex the string that may be fooled [by] complex expressions (specifically, it will fail to lex nested parenthetical expressions such as ab(cd(ef)?gh)ij correctly). If this is the case, the end of the string will not be tokenised correctly and returned as one long string.

martin clayton
I didn't read that in my first reading... :(
sebthebert
Ok thanks, you answered the 'WHY', but I'm also interested by the 'HOW CAN I FIX THAT' :)
sebthebert
+3  A: 

I'm the author of R::A. This question comes up every couple of years. The idea is that you don't want to add complex parenthensised patterns. Add more, simpler patterns, e.g.

run preflight script for .+
run postflight script for .+
Configuring volume .+
Preparing volume .+

Don't try and do the work of the module. For instance, your premature grouping has resulted int the trailing .+ common to all patterns not being factored into one occurence in the regexp. The result is that you have introduced unnecessary backtracking. The more patterns you add, the worse it will be.

Calling add() in a different order will produce the same resulting pattern (or else it's a bug I'd like to know about).

Otherwise you can pretokenise the patterns yourself, and use insert() to insert the pattern lexemes directly into the internal trie structure used to build the pattern. (This will be much faster, because the lexer is very slow: it consumes more than half the runtime for assembling a pattern).

dland
When I assemble the above four patterns, I get `(?:run p(?:ost|re)flight script for|(?:Configu|Prepa)ring volume) .+` . Notice how the module hoisted the 'p' out of the (post|pre) alternation?
dland