tags:

views:

259

answers:

5

How can I remove capturing from arbitrarily nested sub-groups in a a Perl regex string? I'd like to nest any regex into an enveloping expression that captures the sub-regex as a whole entity as well as statically known subsequent groups. Do I need to transform the regex string manually into using all non-capturing (?:) groups (and hope I don't mess up), or is there a Perl regex or library mechanism that provides this?

# How do I 'flatten' $regex to protect $2 and $3?
# Searching 'ABCfooDE' for 'foo' OK, but '((B|(C))fo(o)?(?:D|d)?)', etc., breaks.
# I.E., how would I turn it effectively into '(?:(?:B|(?:C))fo(?:o)?(?:D|d)?)'?
sub check {
  my($line, $regex) = @_;
  if ($line =~ /(^.*)($regex)(.*$)/) {
    print "<", $1, "><", $2, "><", $3, ">\n";
  }
}

Addendum: I am vaguely aware of $&, $`, and $' and have been advised to avoid them if possible, and I don't have access to ${^PREMATCH}, ${^MATCH} and ${^POSTMATCH} in my Perl 5.8 environment. The example above can be partitioned into 2/3 chunks using methods like these, and more complex real cases could manually iterate this, but I think I'd like a general solution if possible.

Accepted Answer: What I wish existed and surprisingly (to me at least) does not, is an encapsulating group that makes its contents opaque, such that subsequent positional backreferences see the contents as a single entity and names references are de-scoped. gbacon has a potentially useful workaround for Perl 5.10+, and FM shows a manual iterative mechanism for any version that can accomplish the same effect in specific cases, but j_random_hacker calls it that there is no real language mechanism to encapsulate subexpressions.

A: 

This doesn't disable capturing, but might accomplish what you want:

$ perl -wle 'my $_ = "123abc"; /(\d+)/ && print "num: $1"; { /([a-z]+)/ && print "letter: $1"; } print "num: $1";'
num: 123
letter: abc
num: 123

You create a new scope and the $1 outside it will not be affected.

nicomen
+7  A: 

One way to protect the subpatterns you care about is to use named capture buffers:

Additionally, as of Perl 5.10.0 you may use named capture buffers and named backreferences. The notation is (?<name>...) to declare and \k<name> to reference. You may also use apostrophes instead of angle brackets to delimit the name; and you may use the bracketed \g{name} backreference syntax. It's possible to refer to a named capture buffer by absolute and relative number as well. Outside the pattern, a named capture buffer is available via the %+ hash. When different buffers within the same pattern have the same name, $+{name} and \k<name> refer to the leftmost defined group.

In the context of your question, check becomes

sub check {
  use 5.10.0;  
  my($line, $regex) = @_;
  if ($line =~ /(^.*)($regex)(.*$)/) {
    print "<", $+{one}, "><", $+{two}, "><", $+{three}, ">\n";
  }
}

Then calling it with

my $pat = qr/(?<one>(?<two>B|(?<three>C))fo(o)?(?:D|d)?)/;   
check "ABCfooDE", $pat;

outputs

<CfooD><C><C>
Greg Bacon
This is a neat technique that I wasn't aware of, but unfortunately, I'm stuck in a RHEL 4 (Perl v5.8.5) environment, so I can't use it for the time being.
Jeff
+5  A: 

This does not address the general case, but your specific example can be handled with the /g option in scalar context, which would allow you to divide the problem into two matches, the second picking up where the first left off:

sub check {
    my($line, $regex) = @_;
    my ($left_side, $regex_match) = ($1, $2) if $line =~ /(^.*)($regex)/g;
    my $right_side = $1 if $line =~ /(.*$)/g;
    print "<$left_side> <$regex_match> <$right_side>\n"; # <AB> <CfooD> <E123>
}

check( 'ABCfooDE123', qr/((B|(C))fo(o)?(?:D|d)?)/ );
FM
Thanks, this technique is probably good enough for me to use for my actual use cases for now. I think I will eventually need a more general solution, so I'm going to keep the question open, though.
Jeff
+7  A: 

In general, you can't.

Even if you could transform all (...)s into (?:...)s, this would not work in the general case because the pattern might require backreferences: e.g. /(.)X\1/, which matches any character, followed by an X, followed by the originally matched character.

So, absent a Perl mechanism for discarding captured results "after the fact", there is no way to solve your problem for all regexes. The best you can do (or could do if you had Perl 5.10) is to use gbacon's suggestion and hope to generate a unique name for the capture buffer.

j_random_hacker
+1  A: 

If all you need is the portion of the string before and after the match, you can use the @- and @+ arrays to get the offsets into the matched string:

sub check {
    my ($line, $regex) = @_;
    if ($line =~ /$regex/) {
        my $pre   = substr $line, 0, $-[0];
        my $match = substr $line, $-[0], $+[0] - $-[0];
        my $post  = substr $line, $+[0];
        print "<$pre><$match><$post>\n";
    }
}
Sean