tags:

views:

100

answers:

1

Playing around with regular expressions, especially the balanced matching of the .NET flavor, I came to a point where I realized that I do not understand the inner workings of the engine as good as I thought I did. I'd appriciate any input on why my patterns behave the way they do! But fist...

Disclaimer: This question is purely theoretical, and any result obtained here will never be used, or modified and used in production code to parse HTML. Ever. I promise. I do fear the pony. =)

Now to my problem. I'll try to match the letter A, if it is not preceeded by an #. To demonstrate, I'll alway use the string ..A..#..A... Here, the first A should be matched. Of course, this is a quite easy task by using "A(?<!^.*#.*)", but I wish to use conditionals here, since they can be used for balanced matchings and other cool things.

What I tried is

"A(?<=^(#(?<q>)|[^#])*(?(q)(?!)))"

The way I interpret it is: when the engine encounteres an "A", it goes back to the start of the string, and for every character add an empty match to the capturing group q if the character is a #. Then it should fail if q contains a match. What I don't understand is why this expression matches both As in my sample string.

When I simply remove the lookbehind and match the whole string, this works:

"^(#(?<q>)|[^#])*(?(q)(?!))A"

matches the whole string up to the first A, even if the first group's quantifier is greedy. Inserting a '#' at the beginning will also cause the match to fail (as desired).

So: how do look around groups, named capturing groups within them and conditionals play together?

Thanks!

Edit: This problem can be seen more easily in (?<=(?<q>)(?(q)(?!)))., which should not match any character, but matches everything.

+3  A: 

Conditionals aren't really that useful in balanced matching--or anywhere else, for that matter. ;) Balanced matching works by using a named capture group as a stack; every time that group matches something, the matched text is pushed onto the stack. There's also special syntax for popping the stack. Here's a good introduction:

http://blog.stevenlevithan.com/archives/balancing-groups

Alan Moore
I actually faced this problem when trying to use balanced matching first. This technique does seem to fail when used in a lookbehind, and I have no clue why. This question is the simplest case where the same failure could be seen.
Jens