views:

539

answers:

4

Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php
  $test_string = 'I want to test sub patterns';
  preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
  print_r($matches);
?>

Array
(
 [0] => I want to test sub patterns //entire pattern
 [1] => I want to test   //entire outer parenthesis
 [2] => want    //first inner
 [3] => to    //second inner
 [4] => patterns    //next parentheses set
)

The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.

So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

+2  A: 

The order of capturing in the order of the left paren is standard across all the platforms I've worked in.

Devin Ceartas
"capturing in the order of the left paren" Thanks for that, it's a much more succinct way of describing the behavior.
Alan Storm
Which platforms have you worked on?
Brad Gilbert
You can re-number the captures in Perl 5.10, and Perl 6.
Brad Gilbert
+8  A: 

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern

the ((red|white) (king|queen))

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

daotoad
+1 but note that he is not using Perl.
Sinan Ünür
@Sinan : he is using PCRE in PHP, which is "Perl-Compatible Regular Expressions" ; so it should be quite the same as using Perl directly
Pascal MARTIN
Pascal, PCRE started as an attempt to be a Perl Compatible Regular Expression set, but in recent years the two have diverged slightly. Still very similar, but there are subtle differences in teh advanced feature sets. (Also, per the question, I'm interested in all platforms)
Alan Storm
He can pick any RE engine out there, it should all be the same.
sixlettervariables
Actually, it's Perl that's doing most of the "drifting away" these days, but you're right: "Perl-compatible" is quickly changing from a misnomer to a non sequitur. :D
Alan Moore
@Alan, Perl is definitely on the move. P5.10 changed a few things, but 6 will be vastly different. The P will almost certainly need to be interpreted as "Perl 5". PCRE is a great project, that I can't praise enough, it has been a godsend on more than a few projects.
daotoad
+2  A: 

Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.

Where it gets interesting is with named groups. In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:

Regex.Replace(@"one two three four", 
              @"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
              @"$1 $2 $3 $4")

// result: "two four one three"

In effect, the number is an alias for the name; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this thread for matching floating-point numbers from different locales:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$

If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.

And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D

Alan Moore
A: 

Yeah, this is all pretty much well defined for all the languages you're interested in:

  • Java - http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
    "Capturing groups are numbered by counting their opening parentheses from left to right. ... Group zero always stands for the entire expression."
  • .Net - http://msdn.microsoft.com/en-us/library/bs2twtah%28VS.71%29.aspx
    "Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.")
  • PHP (PCRE functions) - http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
    "\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern." (It was also true of the deprecated POSIX functions)
  • PCRE - http://www.pcre.org/pcre.txt
    To add to what Alan M said, search for "How pcre_exec() returns captured substrings" and read the fifth paragraph that follows:

    The  first  pair  of  integers, ovector[0] and ovector[1], identify the
    portion of the subject string matched by the entire pattern.  The next
    pair  is  used for the first capturing subpattern, and so on. The value
    returned by pcre_exec() is one more than the highest numbered pair that
    has  been  set.  For example, if two substrings have been captured, the
    returned value is 3. If there are no capturing subpatterns, the  return
    value from a successful match is 1, indicating that just the first pair
    of offsets has been set.
    
  • Perl's different - http://perldoc.perl.org/perlre.html#Capture-buffers
    $1, $2 etc. match capturing groups as you'd expect (i.e. by occurrance of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.

You'll more than likely find similar results for other languages (Python, Ruby and others).

You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.

Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).

Alan Donnelly