views:

261

answers:

3

This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?

One of the archetypal non-regular languages is:

L = { anbn: n > 0 }

This is the language of all non-empty strings consisting of some number of a's followed by an equal number of b's. Examples of strings in this language are ab, aabb, aaabbb.

This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab.

Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.

But just how powerful is this "basic" features? Can we recognize L with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches to match strings like ab, aabb, aaabbb, etc?

References

Linked questions

+15  A: 
polygenelubricants
No doubt there may be mistakes/typos in this long answer, so please leave feedbacks as comments so I can revise them on my own.
polygenelubricants
@poly - Great job. It'll take me a while to read over, but the very last line is basically impossible to read; it is such small font. ------ Oh wait. Is that a `feature`?.... Not sure if it's a good idea. I know what the last symbol is, but it can't be read (aside from copy pasting it).
Peter Ajtai
@Peter: Highlight the small text and copy and paste into something else. It's made difficult to read on purpose: it's a spoiler, the solution to the bonus puzzle.
polygenelubricants
+1: Fantastic explanation, these "Advanced articles" are brilliant ideas.
Callum Rogers
@poly - great article. You asked for feedback... you mention java regexes in the title, but then use PHP. I'm confused by that. Does PHP's preg_match() use java regexes?
LarsH
+6  A: 

Given that no mention has been made of PCRE supporting recursive patterns, I'd just like to point out the simplest and most efficient example of PCRE that describes the language in question:

/^(a(?1)?b)$/

+1 wow, I didn't know PCRE supports recursive pattern (I'm still learning! Every day!). I've revised the article to accommodate this information. I don't think recursive pattern can match `a^n b^n c^n`, though.
polygenelubricants
It should be noted this option is simpler, but not as good as the posted answer - the recursion overflows on long strings.
Kobi
As a^n b^n c^n is not context-free, we cannot match it with as simple an expression as the original. However, it is possible by incorporating a zero-width assertion, (?= ) in this case: /^(a(?1)c|(?=(b(?2)?c))b+)$/
@user410484 wow that makes my head hurt. Somebody want to unpack and explain that last regexp, `/^(a(?1)c|(?=(b(?2)?c))b+)$/`?
LarsH
+5  A: 
KennyTM
@poly: Thanks :). Actually I am not familiar with .NET patterns, but for this kind of patterns it turns out to be very easy with balancing groups, so I supplement this answer.
KennyTM
@Kenny: can you do this with recursive pattern? Because if you can't, that's an interesting twist that balancing group can do things that recursive pattern can't. (And yes, I very much appreciate the supplement).
polygenelubricants
@Kenny: by the way, the reason why I omitted .NET solution was because I do have plans for "How can we match `a^n b^n` with .NET regex?" article in the future, but you're more than welcome to write it if you want. I'm not doing these articles just for myself; I do want to encourage others to do it as well to have good content on the site.
polygenelubricants
Please update if you figure out a way to do it with recursive patterns. I played around with balancing groups to capture words whose lengths make a Fibonacci series, and couldn't get it to work. It may be possible using look-around, similar to what I've done.
Kobi
LarsH