tags:

views:

312

answers:

4

I have found the following resources on Balanced Matching for .net Regexes:

From what I have read in these, the following example should work:

This regex should find an "a" anywhere within an angle-bracket group, no matter how deep. It should match "<a>", "<<a>>", "<a<>>", "<<>a>", "<<><a>>", etc.

(?<=
    ^
    (
     (
      <(?<Depth>)
      |
      >(?<-Depth>)
     )
     [^<>]*?
    )+?
)
(?(Depth)a|(?!))

matching on the "a" in the string "<<>a>"

While it will work for strings "<a<>>" and "<<a>>", I can't get it to match an "a" that is following a ">".

According to the explanations I have read, the first two "<"s should increment Depth twice, then the first ">" should decrement it once. At this point, (?(Depth)a|(?!)) should perform the "yes" option, but the regex never even makes it here.

Consider the following regex, which makes no such check and still fails to match the string in question:

(?<=
    ^
    (
     (
      <(?<Depth>)
      |
      >(?<-Depth>)
     )
     [^<>]*?
    )+?
)
a

Am I missing something, or is the regex engine working incorrectly?

+1  A: 

It's generally a safe assumption that classes in a library litterally used by millions do not have any major bugs :D

the below regex will match all the above version of <>a

var pattern =  "(" +
                       "((?'Open'<)[a]?)+" +
                       "((?'Close-Open'>)[a]?)+" +
                     ")*" +
                     "(?(Open)(?!))$";
Rune FS
mind leaving a reason to the down vote?
Rune FS
Sorry about the confusion, I wasn't clear about what all I wanted my regex example to do. I have edited my original question to specify this. While your regex will match an a following any matched group of angle brackets, it will not do what I had originally intended.
Paul Williams
Downvoted because your pattern matches too much. It matches <<>>a, for instance.
patjbs
+4  A: 

You need to remember that a lookbehind only scans as far back as it has to in order to satisfy its embedded regex. The expression in your lookbehind is only required to match an angle bracket, so it only looks as far back as the latest one. If it's a left angle bracket, (?<Depth>) pushes an empty string onto the stack represented by that capture group. But if it's a right angle bracket...

It is worth mentioning that if no named group N exists when trying to pop (<-N>) then it will fail... *

In other words it's not the conditional expression -- (?(Depth)a|(?!)) -- that's making your regex fail (as you observed), it's the attempt to "decrement" the "counter". As far as I can tell, your regex is exactly equivalent to

(?<=<[^<>]*)a

So, to answer your question, .NET's balanced-construct matching is not broken. Byzantine yes, but broken, no. :D

Alan Moore
I wish it were that simple. In fact, the only reason that I did not force the lookbehind to go all the way to the string was that doing so appeared not to effect results. I have changed the regex in my question so as to avoid further confusion.
Paul Williams
I tried anchoring the lookbehind too, but it still seems to insist on scanning backward. I know it seems like the anchor should force it to jump to the beginning and scan forward, but it doesn't. If the most recent angle bracket happens to be a rightie, it still tries to pop an empty stack and gives up.
Alan Moore
This is what I've found as well - the lookbehind functions in a right to left manner, which is what is causing this to fail, regardless of anchoring.
patjbs
+1  A: 

Completely revised answer (first two comments were for a previous, incomplete answer):

I have figured out how to accomplish this in a way that I can "replace all" on the results.

string input = @"a<a<<a>>a<a>a>a<a>a";
Regex reg = new Regex(@"
    (?<=
     <
     [^<>]*
     (?(ReverseDepth)(?!))
     (?:
      (?:
       <(?<-ReverseDepth>)
       |
       >(?<ReverseDepth>)
      )
      [^<>]*
     )*
    )
    a
    ", RegexOptions.IgnorePatternWhitespace);
Console.WriteLine(reg.Replace(input, "b"));

This produces the following output:

a<b<<b>>b<b>b>a<b>a

I now realize that my question did not specify this, but I never actually cared to check whether or not the group ever closes fully, since the text I am going to apply this to is pre-validated xml. In order to match my answer to the question, though, and prevent the 'a' in "<a" from matching, the following regex can be used in place of the one I have provided here:

Regex reg = new Regex(@"
    (?<=
     <
     [^<>]*
     (?(ReverseDepth)(?!))
     (?:
      (?:
       <(?<-ReverseDepth>)
       |
       >(?<ReverseDepth>)
      )
      [^<>]*
     )*
    )
    a
    (?=
     (?:
      (?:
       <(?<Depth>)
       |
       >(?<-Depth>)
      )
      [^<>]*
     )*
     (?(Depth)(?!))
     [^<>]*
     >
    )
    ", RegexOptions.IgnorePatternWhitespace);
Paul Williams
Isn't that what I said?
Alan Moore
In retrospect, I can see that that is what you said. Your answer assumed that I already knew that the lookbehind was causing the stack operations to happen in the reverse of textual order, but I hadn't figured that out yet.
Paul Williams
Yeah, I guess I did gloss over that point a bit. So, by matching one angle bracket without trying to change the depth counter, you avoid the problem of trying to pop a nonexistent stack. And now your lookbehind succeeds when it sees more opening than closing brackets, and your lookahead, vice-versa. Nice one!
Alan Moore
+4  A: 

If you want to find every 'a' that's inside a balanced pair of angle brackets, I would suggest this approach:

Regex r = new Regex(@"
    <
      (?>
         [^<>a]+
       |
         (a)
       |
         <(?<N>)
       |
         >(?<-N>)
      )+
    (?(N)(?!))
    >
", RegexOptions.IgnorePatternWhitespace);
string target = @"012a<56a8<0a2<4a6a>>012a<56789a>23456a";
foreach (Match m in r.Matches(target))
{
  Console.WriteLine("{0}, {1}", m.Index, m.Value);
  foreach (Capture c in m.Groups[1].Captures)
  {
    Console.WriteLine("{0}, {1}", c.Index, c.Value);
  }
}

result:

9, <0a2<4a6a>>
11, a
15, a
17, a
24, <56789a>
30, a

Instead of mucking about with the conditional, it goes ahead and matches the whole bracket-delimited (sub)string, in the process capturing any a's it might contain. Unlike your approach, it can pluck any number of bracketed substrings out of a larger string, and any number of a's out of each substring.

Alan Moore
Thanks, Alan. This regex is the first I have seen which appears to actually capture every desired 'a' and no undesired 'a's.I want to use the regex for replacement, so I will have to do my replacements manually (I can't see a way to use the built-in Replace method.)
Paul Williams
I found a way to make my regex work with the built-in Replace method (and therefore the text editor I am using as well.)Your answer is still the first correct answer to my question, but I want my final solution to be available to anyone who is browsing, so I have edited my answer below to provide it.
Paul Williams
why not use the "readable" form of the regex in the actual source code, where it's most beneficial? :) I took the liberty of making it so..
Jeff Atwood
Probably because I hadn't yet realized that C# verbatim strings could handle embedded line breaks. :) Thanks, Jeff.
Alan Moore