ansaurus

Question

Capturing inner items using .net Regex Balanced Matching

Answer 1

+1 A:

It's generally a safe assumption that classes in a library litterally used by millions do not have any major bugs :D

the below regex will match all the above version of <>a

var pattern =  "(" +
                       "((?'Open'<)[a]?)+" +
                       "((?'Close-Open'>)[a]?)+" +
                     ")*" +
                     "(?(Open)(?!))$";

Rune FS 2009-09-01 19:51:53

mind leaving a reason to the down vote?

Rune FS 2009-09-01 21:21:07

Sorry about the confusion, I wasn't clear about what all I wanted my regex example to do. I have edited my original question to specify this. While your regex will match an a following any matched group of angle brackets, it will not do what I had originally intended.

Paul Williams 2009-09-01 21:22:03

Downvoted because your pattern matches too much. It matches <<>>a, for instance.

patjbs 2009-09-01 21:51:19

Answer 2

+4 A:

You need to remember that a lookbehind only scans as far back as it has to in order to satisfy its embedded regex. The expression in your lookbehind is only required to match an angle bracket, so it only looks as far back as the latest one. If it's a left angle bracket, (?<Depth>) pushes an empty string onto the stack represented by that capture group. But if it's a right angle bracket...

It is worth mentioning that if no named group N exists when trying to pop (<-N>) then it will fail... *

In other words it's not the conditional expression -- (?(Depth)a|(?!)) -- that's making your regex fail (as you observed), it's the attempt to "decrement" the "counter". As far as I can tell, your regex is exactly equivalent to

(?<=<[^<>]*)a

So, to answer your question, .NET's balanced-construct matching is not broken. Byzantine yes, but broken, no. :D

Alan Moore 2009-09-01 20:32:51

I wish it were that simple. In fact, the only reason that I did not force the lookbehind to go all the way to the string was that doing so appeared not to effect results. I have changed the regex in my question so as to avoid further confusion.

Paul Williams 2009-09-01 21:24:38

I tried anchoring the lookbehind too, but it still seems to insist on scanning backward. I know it seems like the anchor should force it to jump to the beginning and scan forward, but it doesn't. If the most recent angle bracket happens to be a rightie, it still tries to pop an empty stack and gives up.

Alan Moore 2009-09-01 21:46:52

This is what I've found as well - the lookbehind functions in a right to left manner, which is what is causing this to fail, regardless of anchoring.

patjbs 2009-09-01 21:47:55

Answer 3

+1 A:

Completely revised answer (first two comments were for a previous, incomplete answer):

I have figured out how to accomplish this in a way that I can "replace all" on the results.

string input = @"a<a<<a>>a<a>a>a<a>a";
Regex reg = new Regex(@"
    (?<=
     <
     [^<>]*
     (?(ReverseDepth)(?!))
     (?:
      (?:
       <(?<-ReverseDepth>)
       |
       >(?<ReverseDepth>)
      )
      [^<>]*
     )*
    )
    a
    ", RegexOptions.IgnorePatternWhitespace);
Console.WriteLine(reg.Replace(input, "b"));

This produces the following output:

a<b<<b>>b<b>b>a<b>a

I now realize that my question did not specify this, but I never actually cared to check whether or not the group ever closes fully, since the text I am going to apply this to is pre-validated xml. In order to match my answer to the question, though, and prevent the 'a' in "<a" from matching, the following regex can be used in place of the one I have provided here:

Regex reg = new Regex(@"
    (?<=
     <
     [^<>]*
     (?(ReverseDepth)(?!))
     (?:
      (?:
       <(?<-ReverseDepth>)
       |
       >(?<ReverseDepth>)
      )
      [^<>]*
     )*
    )
    a
    (?=
     (?:
      (?:
       <(?<Depth>)
       |
       >(?<-Depth>)
      )
      [^<>]*
     )*
     (?(Depth)(?!))
     [^<>]*
     >
    )
    ", RegexOptions.IgnorePatternWhitespace);

Paul Williams 2009-09-01 21:45:14

Isn't that what I said?

Alan Moore 2009-09-02 00:11:08

In retrospect, I can see that that is what you said. Your answer assumed that I already knew that the lookbehind was causing the stack operations to happen in the reverse of textual order, but I hadn't figured that out yet.

Paul Williams 2009-09-02 15:41:22

Yeah, I guess I did gloss over that point a bit. So, by matching one angle bracket without trying to change the depth counter, you avoid the problem of trying to pop a nonexistent stack. And now your lookbehind succeeds when it sees more opening than closing brackets, and your lookahead, vice-versa. Nice one!

Alan Moore 2009-09-02 20:40:18

Answer 4

+4 A:

If you want to find every 'a' that's inside a balanced pair of angle brackets, I would suggest this approach:

Regex r = new Regex(@"
    <
      (?>
         [^<>a]+
       |
         (a)
       |
         <(?<N>)
       |
         >(?<-N>)
      )+
    (?(N)(?!))
    >
", RegexOptions.IgnorePatternWhitespace);
string target = @"012a<56a8<0a2<4a6a>>012a<56789a>23456a";
foreach (Match m in r.Matches(target))
{
  Console.WriteLine("{0}, {1}", m.Index, m.Value);
  foreach (Capture c in m.Groups[1].Captures)
  {
    Console.WriteLine("{0}, {1}", c.Index, c.Value);
  }
}

result:

9, <0a2<4a6a>>
11, a
15, a
17, a
24, <56789a>
30, a

Instead of mucking about with the conditional, it goes ahead and matches the whole bracket-delimited (sub)string, in the process capturing any a's it might contain. Unlike your approach, it can pluck any number of bracketed substrings out of a larger string, and any number of a's out of each substring.

Alan Moore 2009-09-02 00:41:01

Thanks, Alan. This regex is the first I have seen which appears to actually capture every desired 'a' and no undesired 'a's.I want to use the regex for replacement, so I will have to do my replacements manually (I can't see a way to use the built-in Replace method.)

Paul Williams 2009-09-02 16:01:13

I found a way to make my regex work with the built-in Replace method (and therefore the text editor I am using as well.)Your answer is still the first correct answer to my question, but I want my final solution to be available to anyone who is browsing, so I have edited my answer below to provide it.

Paul Williams 2009-09-02 17:57:02

why not use the "readable" form of the regex in the actual source code, where it's most beneficial? :) I took the liberty of making it so..

Jeff Atwood 2010-01-11 10:55:55

Probably because I hadn't yet realized that C# verbatim strings could handle embedded line breaks. :) Thanks, Jeff.

Alan Moore 2010-01-11 21:17:35

ansaurus

tags:

views:

answers:

Capturing inner items using .net Regex Balanced Matching

related questions