tags:

views:

103

answers:

3

How do you back reference inner parenthesis in Regex?

The sample data is a product price list showing different price breaks based on quantity purchased. The format is quantityLow - quantityHigh : pricePer ; multiples.

I used LINQPad to construct this C# Regex expression to separate the parts, which shows a handy visualization of the Regex data separation. In this example, there are "inner" parenthesis (selections), creating a hierarchical data structure.

string mys = "1-4:2;5-9:1.89";
Regex.Matches (mys, @"((\d+)[-|\+](\d*):(\d+\.?\d*);?)").Dump();  // Graphically show

This breaks down to (Match is everything. Within match, there is a single match and a group match. Within the group match is a few single matches.)

  • MatchCollection (2 items)
    • Group Collection (4 items)
      • CaptureCollection (1 item) () Group "1-4:2;"
      • CaptureCollection (1 item) () Group "1"
      • CaptureCollection (1 item) () Group "4"
      • CaptureCollection (1 item) () Group "2"
    • CaptureCollection (1 item) () Match "1-4;2;"
    • Group Collection (4 items)
      • CaptureCollection (1 item) () Group "5-9:1.89"
      • CaptureCollection (1 item) () Group "5"
      • CaptureCollection (1 item) () Group "9"
      • CaptureCollection (1 item) () Group "1.89"
    • CaptureCollection (1 item) () Match "5-9:1.89"

Just for reference:

  • () parenthesis group found results which can be referenced by a \1..\9 (I think).
  • \d matches a single digit. The + after matches one or more digits. * after matches zero or more digits. ? after says this match is optional.
  • . matches a single character. \. matches a period or decimal in this case.
+4  A: 

Just use \1 ... \9 (or $1 ... $9 in some regex implementations) like you normally would. The numbering is from left to right, based on the position of the open paren (so a nested group has a higher number than the group(s) it's nested within).

Laurence Gonsalves
Or `\k<foo>` to backreference a named group `(?<foo>...)`, when there are too many.
Pavel Minaev
Anyone have any sample code to do a named backreference with an autonumber (identity)? Something like (?<name>[1-8]...) which puts name1, name2, name3, name4, etc..?
Dr. Zim
+1  A: 

As a side note, character classes always match a single character and the "normal" meta characters do not apply in them. So you class [-|\+] matches one of the three character -, | or +. As you see, the logical OR meta character does not have a special meaning inside a character class. And you need not escape the + character inside a character class, so this should do it: [-+].

Bart Kiers
After researching this, I agree that the pipe does not "or", but wouldn't you still need to "quote" the minus and the plus within the class brackets? For example: /^[\d\s\(\)\-\+\/]*$/ would match a phone number of 714/921-5424 (Examples from the VisiBone charts), or is this implementation dependent?
Dr. Zim
Oddly, both ways seem to work fine. I picked up "Regulator" which at least shows how regex are broken down. If it has a feature to set the implementation, I think I am in business.
Dr. Zim
Note that the class `/^[\d\s\(\)\-\+\/]*$/` is equivalent to `/^[\d\s()+\/-]*$/`
Bart Kiers
+1  A: 

Note that this is in reply to Dr. Zim's comment:

"Oddly, both ways seem to work fine. I picked up "Regulator" which at least shows how regex are broken down. If it has a feature to set the implementation, I think I am in business."

but my answer was too long for the comment box.

No, you don't need to escape the plus, and in this case the hyphen. Inside a character class, the following characters have a special meaning: ], ^ and -. These three characters are the only characters that might need escaping (note that the [ needs no escaping!). I say might because it depends on where these meta characters occur. The ^ only has a special meaning (as a negation indicator) when placed at the start of a character class, elsewhere, it needs no escaping and will match just the literal ^. Some examples to illustrate:

[^a]   // special meaning: matches any character except 'a'
[a^]   // matches 'a' or '^'
[\^a]  // matches '^' or 'a'

And the hyphen only has a special meaning (as a range indicator) when placed not at the start or end of a character class. Examples:

[a-c]  // special meaning: matches 'a', 'b' or 'c'
[ac-]  // matches 'a', 'c' or '-'
[-ac]  // matches '-', 'a' or 'c'
[a\-c] // matches 'a', '-' or 'c'

Without a doubt some regex implementations might differ from what I just posted, but the majority of languages will comply with these rules (all languages I worked with at least!). And as you noticed, it is safe to over escape characters inside character classes: it does not do any harm. Both the classes [+] and [\+] will match the literal +. IMHO, the first is preferred because I find a regex with too many escapes hard to read. But some will disagree with me and find that it is extra clear by using an escape (while not necessary) that the literal + is being matched instead of the greedy quantifier.

Hope that clears things up.

Bart Kiers