views:

199

answers:

3

I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so? What features of modern regular expressions are not regular? Examples would be helpful.

+17  A: 

The first thing that comes to mind is backreferences:

(\w*)\s\1

(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello matches, hello world doesn't.

This construct is not regular (ie: can't be generated by a regular grammar).


Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:

\((a*|(?R))*\)

This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)

NullUserException
Some backreferences can be made in a regular language. For example `(.)x\1` defines a regular language: "axa", "bxb", etc. I believe it's only when combined with Kleene closures that backreferences make the language irregular.
Gabe
You don't need the space in there. `(.*)\1` will do.
Nabb
@Nabb: `.` matches a much larger range of characters than just `\w*\s`
BoltClock
@Nabb It's more to demonstrate real-world usage/for clarity's sake.
NullUserException
+2  A: 

A deterministic or nondeterministic finite automaton recognizes just the regular languages, which are described by regular expressions. The definition of a regular expression is simple. Let S be an alphabet. Then the empty set, the empty string, and every element of S are regular expressions (over S). Let u and v be regular expressions. Then the union (u | v), concatenation (uv), and closure (u*) of u and v are regular expressions over S. This definition is easily extended to the regular languages. No other expression is a regular expression. As pointed out, some back-references are an example. The Wikipedia pages on regular languages and expressions are good references.

In essence, certain "regular expressions" are not regular because no automaton of a particular type can be constructed to recognize them. For example, the the language

{ a^ i b^ i : i <= 0 }

is not regular. This is because the accepting automaton would require infinitely many states, but an automaton accepting regular languages must have a finite number of states.

danportin
Judging from the original question, I'm pretty sure he understands the distinction between regular and non-regular languages. His question is, which features of modern "regular expression" implementations define languages that are not regular, and therefore cannot be expressed in some way using the operations you listed.
Adrian Petrescu
Maybe I should read more closely, then! In any case, I don't think I caused any harm.
danportin
`a^i b^i` is certainly nonregular (it's a DCFG), but can we actually express this using the "regular expressions" of programming languages?
Nabb
+3  A: 

Hey,

A couple of examples:

  • Regular expressions support grouping. E.g. in Ruby: /my (group)/.match("my group")[1] will output "group". storing something in a group requires an external storage, which a finite automaton does not have.
  • Many languages, e.g. C#, support captures, i.e. that each match will be captured on a stack - for example the pattern (?<MYGROUP>.)* could perform multiple captures of "." in the same group.
  • Grouping are used for backreferencing as pointed out by the user NullUserException above. Backreferencing requires one or more external stacks with the power of a push-down-automaton (you have to be able to push something on a stack and peek or pop it afterwards.
  • Some engines have the possibility of seperately pushing and popping external stacks and checking whether the stack is empty. In .NET, actually (?<MYGROUP>test) pushes a stack, while (?<-MYGROUP>) pops a stack.
  • Some engines like the .NET engine have a balanced grouping concept - where an external stack can be both pushed and popped at the same time. Balanced grouping syntax is (?<FIRSTGROUP-LASTGROUP>) which pops the LASTGROUP and pushes the capture since the LASTGROUP index on the FIRSTGROUP stack. This can actually be used to match infinitely nested constructions which is definitely beyond the power of a finite automaton.

Probably other good examples exist :-) If you are further interessted in some of the implementation details of external stacks in combination with Regex's and balanced grouping and thus higher order automata than finite automata, I once wrote two short articles on this (http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx and http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx).

Anyway - finitieness or not - I blieve that the power that this extra stuff brings to the regular languages is great :-)

Br. Morten

Maate
Grouping and capturing are not features that make the language irregular -- all they do is provide metadata, not change the expressiveness of the language. Obviously anything that involves a stack (like backreferences) do make for irregular languages though.
Gabe