ansaurus

Question

How are nested capturing groups numbered in regular expressions?

Answer 1

+2 A:

The order of capturing in the order of the left paren is standard across all the platforms I've worked in.

Devin Ceartas 2009-08-21 19:57:31

"capturing in the order of the left paren" Thanks for that, it's a much more succinct way of describing the behavior.

Alan Storm 2009-08-21 19:58:40

Which platforms have you worked on?

Brad Gilbert 2009-08-21 20:03:39

You can re-number the captures in Perl 5.10, and Perl 6.

Brad Gilbert 2009-08-24 20:32:05

Answer 2

+8 A:

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

daotoad 2009-08-21 20:00:17

+1 but note that he is not using Perl.

Sinan Ünür 2009-08-21 20:21:04

@Sinan : he is using PCRE in PHP, which is "Perl-Compatible Regular Expressions" ; so it should be quite the same as using Perl directly

Pascal MARTIN 2009-08-21 20:37:15

Pascal, PCRE started as an attempt to be a Perl Compatible Regular Expression set, but in recent years the two have diverged slightly. Still very similar, but there are subtle differences in teh advanced feature sets. (Also, per the question, I'm interested in all platforms)

Alan Storm 2009-08-21 20:55:24

He can pick any RE engine out there, it should all be the same.

sixlettervariables 2009-08-21 21:01:19

Actually, it's Perl that's doing most of the "drifting away" these days, but you're right: "Perl-compatible" is quickly changing from a misnomer to a non sequitur. :D

Alan Moore 2009-08-22 14:44:25

@Alan, Perl is definitely on the move. P5.10 changed a few things, but 6 will be vastly different. The P will almost certainly need to be interpreted as "Perl 5". PCRE is a great project, that I can't praise enough, it has been a godsend on more than a few projects.

daotoad 2009-08-24 03:47:06

Answer 3

+2 A:

Every regex flavor I know numbers groups by the order in which the opening parentheses appear. That outer groups are numbered before their contained sub-groups is just a natural outcome, not explicit policy.

Where it gets interesting is with named groups. In most cases, they follow the same policy of numbering by the relative positions of the parens--the name is merely an alias for the number. However, in .NET regexes the named groups are numbered separately from numbered groups. For example:

Regex.Replace(@"one two three four", 
              @"(?<one>\w+) (\w+) (?<three>\w+) (\w+)",
              @"$1 $2 $3 $4")

// result: "two four one three"

In effect, the number is an alias for the name; the numbers assigned to named groups start where the "real" numbered groups leave off. That may seem like a bizarre policy, but there's a good reason for it: in .NET regexes you can use the same group name more than once in a regex. That makes possible regexes like the one from this thread for matching floating-point numbers from different locales:

^[+-]?[0-9]{1,3}
(?:
    (?:(?<thousand>\,)[0-9]{3})*
    (?:(?<decimal>\.)[0-9]{2})?
|
    (?:(?<thousand>\.)[0-9]{3})*
    (?:(?<decimal>\,)[0-9]{2})?
|
    [0-9]*
    (?:(?<decimal>[\.\,])[0-9]{2})?
)$

If there's a thousands separator, it will be saved in group "thousand" no matter which part of the regex matched it. Similarly, the decimal separator (if there is one) will always be saved in group "decimal". Of course, there are ways to identify and extract the separators without reusable named groups, but this way is so much more convenient, I think it more than justifies the weird numbering scheme.

And then there's Perl 5.10+, which gives us more control over capturing groups than I know what to do with. :D

Alan Moore 2009-08-21 21:43:47

Answer 4

A:

Yeah, this is all pretty much well defined for all the languages you're interested in:

Java - http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html#cg
"Capturing groups are numbered by counting their opening parentheses from left to right. ... Group zero always stands for the entire expression."
.Net - http://msdn.microsoft.com/en-us/library/bs2twtah%28VS.71%29.aspx
"Captures using () are numbered automatically based on the order of the opening parenthesis, starting from one. The first capture, capture element number zero, is the text matched by the whole regular expression pattern.")
PHP (PCRE functions) - http://www.php.net/manual/en/function.preg-replace.php#function.preg-replace.parameters
"\0 or $0 refers to the text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern." (It was also true of the deprecated POSIX functions)

PCRE - http://www.pcre.org/pcre.txt
To add to what Alan M said, search for "How pcre_exec() returns captured substrings" and read the fifth paragraph that follows:

The  first  pair  of  integers, ovector[0] and ovector[1], identify the
portion of the subject string matched by the entire pattern.  The next
pair  is  used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
has  been  set.  For example, if two substrings have been captured, the
returned value is 3. If there are no capturing subpatterns, the  return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.

Perl's different - http://perldoc.perl.org/perlre.html#Capture-buffers
$1, $2 etc. match capturing groups as you'd expect (i.e. by occurrance of opening bracket), however $0 returns the program name, not the entire query string - to get that you use $& instead.

You'll more than likely find similar results for other languages (Python, Ruby and others).

You say that it's equally logical to list the inner capture groups first and you're right - it's just be a matter of indexing on closing, rather than opening, parens. (if I understand you correctly). Doing this is less natural though (for example it doesn't follow reading direction convention) and so makes it more difficult (probably not significantly) to determine, by insepection, which capturing group will be at a given result index.

Putting the entire match string being in position 0 also makes sense - mostly for consistency. It allows the entire matched string to remain at the same index regardless of the number capturing groups from regex to regex and regardless of the number of capturing groups that actually match anything (Java for example will collapse the length of the matched groups array for each capturing group does not match any content (think for example something like "a (.*)pattern"). You could always inspect capturing_group_results[capturing_group_results_length - 2], but that doesn't translate well to languages to Perl which dynamically create variables ($1, $2 etc.) (Perl's a bad example of course, since it uses $& for the matched expression, but you get the idea :).

Alan Donnelly 2009-08-22 03:31:10

ansaurus

tags:

views:

answers:

How are nested capturing groups numbered in regular expressions?

related questions