views:

431

answers:

3

Is there a canonical ordering of submatch expressions in a regular expression?

For example: What is the order of the submatches in
"(([0-9]{3}).([0-9]{3}).([0-9]{3}).([0-9]{3}))\s+([A-Z]+)" ?

a. (([0-9]{3})\.([0-9]{3})\.([0-9]{3})\.([0-9]{3}))\s+([A-Z]+)  
   (([0-9]{3})\.([0-9]{3})\.([0-9]{3})\.([0-9]{3}))  
   ([A-Z]+)  
   ([0-9]{3})  
   ([0-9]{3})  
   ([0-9]{3})  
   ([0-9]{3})  

b. (([0-9]{3})\.([0-9]{3})\.([0-9]{3})\.([0-9]{3}))\s+([A-Z]+)  
   (([0-9]{3})\.([0-9]{3})\.([0-9]{3})\.([0-9]{3}))  
   ([0-9]{3})  
   ([0-9]{3})  
   ([0-9]{3})  
   ([0-9]{3})  
   ([A-Z]+)

or

c. somthin' else.
+4  A: 

They tend to be numbered in the order the capturing parens start, left to right. Therefore, option b.

jjrv
+2  A: 

In Perl 5 regular expressions, answer b is correct. Submatch groupings are stored in order of open-parentheses.

Many other regular expression engines take their cues from Perl, but you would have to look up individual implementations to be sure. I'd suggest the book Mastering Regular Expressions for a deeper understanding.

Adrian Dunston
A: 

You count opening parentheses, left to right. So the order would be

(([0-9]{3}).([0-9]{3}).([0-9]{3}).([0-9]{3}))
([0-9]{3})
([0-9]{3})
([0-9]{3})
([0-9]{3})
([A-Z]+)

At least this is what Perl would do. Other regex engines might have different rules.

Asgeir S. Nilsen