tags:

views:

85

answers:

7

A while ago, I saw in regex (at least in PHP) you can make a capturing group not capture by doing prepending ?:.

Example

$str = 'big blue ball';
$regex = '/b(ig|all)/';
preg_match_all($regex, $str, $matches);
var_dump($matches);

Outputs...

array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(3) "big"
    [1]=>
    string(4) "ball"
  }
  [1]=>
  array(2) {
    [0]=>
    string(2) "ig"
    [1]=>
    string(3) "all"
  }
}

In this example, I don't care about what was matched in the parenthesis, so I appended the ?: ('/b(?:ig|all)/') and got output

array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(3) "big"
    [1]=>
    string(4) "ball"
  }
}

This is very useful - at least I think so. Sometimes you just don't want to clutter your matches with unnecessary values.

I was trying to look up documentation and the official name for this (I call it a non capturing group, but I think I've heard it before).

Being symbols, it seemed hard to Google for.

I have also looked at a number of regex reference guides, with no mention.

Being prefixed with ?, and appearing in the first chars inside parenthesis would leave me to believe it has something to do with lookaheads or lookbehinds.

So, what is the proper name for these, and where can I learn more?

Thanks

Update

Plenty of answers, thank you! I'll accept an answer in the morning.

+2  A: 

(?:) as a whole represents a non-capturing group.

Regular-expressions.info mentions this syntax :

The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. Note the question mark [...] is the regex operator that makes the previous token optional. This operator cannot appear after an opening round bracket, because an opening bracket by itself is not a valid regex token. Therefore, there is no confusion between the question mark as an operator to make a token optional, and the question mark as a character to change the properties of a pair of round brackets. The colon indicates that the change we want to make is to turn off capturing the backreference.

madgnome
+3  A: 

It's available on the Subpatterns page of the official documentation.

The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200.

It's also good to note that you can set options for the subpattern with it. For example, if you want only the sub-pattern to be case insensitive, you can do:

(?i:foo)bar

Will match:

  • foobar
  • Foobar
  • FoObar
  • ...etc

But not

  • fooBar
  • FooBAR
  • ...etc

Oh, and while the official documentation doesn't actually explicitly name the syntax, it does refer to it later on as a "non-capturing subpattern" (which makes complete sense, and is what I would call it anyway, since it's not really a "group", but a subpattern)...

ircmaxell
+2  A: 

Here's what I've found:

If you do not use the backreference, you can optimize this regular expression into Set(?:Value)?. The question mark and the colon after the opening round bracket are the special syntax that you can use to tell the regex engine that this pair of brackets should not create a backreference. Note the question mark after the opening bracket is unrelated to the question mark at the end of the regex. That question mark is the regex operator that makes the previous token optional. This operator cannot appear after an opening round bracket, because an opening bracket by itself is not a valid regex token. Therefore, there is no confusion between the question mark as an operator to make a token optional, and the question mark as a character to change the properties of a pair of round brackets. The colon indicates that the change we want to make is to turn off capturing the backreference.

http://www.regular-expressions.info/brackets.html

Ruel
+1  A: 

a Google search for non-capturing group should turn up the info you seek.

Ty W
+1  A: 

It's in the php manual, and I believe any other near-complete regular expression section for any language…

The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns.

Source

poke
A: 

I don't know how do this with ?:, but it is easy with simple loop:

$regex = '/b(ig|all)/';
$array = array(
    0 => array(0 => 'big', 1 => 'ball'),
    1 => array(0 => 'ig', 1 => 'all')
);
foreach ($array as $key => $row) {
    foreach ($row as $val) {
        if (!preg_match($regex, $val)) {
            unset($array[$key]);
        }
    }
}
print_r($array);
Alexander.Plutov
I don't think you read the question correctly.
alex
I know. I just only proposed a example.
Alexander.Plutov
+1  A: 

PHP's preg_match_all uses the PCRE (Perl-Compatible Regular Expression) syntax, which is documented here. Non-capturing subpatterns are documented in the Subpatterns chapter.

would leave me to believe it has something to do with lookaheads or lookbehinds.

Nope, there are lots of different features which are triggered by open-bracket-question-mark. Lookahead/lookbehind is just the first one you met.

It's messy that many options have to be squeezed into (?, instead of given a more readable syntax of their own, but it was necessary to fit everything into a sequence that was previously not a valid expression in itself, in older variants of regex.

bobince