This is the second part of a series of educational regex articles. It shows how lookaheads and nested references can be used to match the non-regular languge anbn. Nested references are first introduced in: How does this regex find triangular numbers?
One of the archetypal non-regular languages is:
L = { a
nb
n: n > 0 }
This is the language of all non-empty strings consisting of some number of a
's followed by an equal number of b
's. Examples of strings in this language are ab
, aabb
, aaabbb
.
This language can be show to be non-regular by the pumping lemma. It is in fact an archetypal context-free language, which can be generated by the context-free grammar S → aSb | ab
.
Nonetheless, modern day regex implementations clearly recognize more than just regular languages. That is, they are not "regular" by formal language theory definition. PCRE and Perl supports recursive regex, and .NET supports balancing groups definition. Even less "fancy" features, e.g. backreference matching, means that regex is not regular.
But just how powerful is this "basic" features? Can we recognize L
with Java regex, for example? Can we perhaps combine lookarounds and nested references and have a pattern that works with e.g. String.matches
to match strings like ab
, aabb
, aaabbb
, etc?
References
- perlfaq6: Can I use Perl regular expressions to match balanced text?
- MSDN - Regular Expression Language Elements - Balancing Group Definitions
- pcre.org - PCRE man page
- regular-expressions.info - Lookarounds and Grouping and Backreferences
java.util.regex.Pattern