views:

144

answers:

3

Classical regular expressions are equivalent to finite automata. Most current implementations of "regular expressions" are not strictly speaking regular expressions but are more powerful. Some people have started using the term "pattern" rather than "regular expression" to be more accurate.

What is the formal language classification of what can be described with a modern "regular expression" such as the patterns supported in Perl 5?

Update: By "Perl 5" I mean that pattern matching functionality implemented in Perl 5 and adopted by numerous other languages (C#, JavaScript, etc) and not anything specific to Perl. I don't want to consider, for example, tricks for embedding Perl code in a pattern.

+3  A: 

I've always heard perl's regex implementation described as an NFA with backtracking. Wikipedia seems to have a little section on this:

This is possibly slightly too fuzzy but it's informative non the less:

From Wikipedia:

There are at least three different algorithms that decide if and how a given regular expression matches a string.

The oldest and fastest two rely on a result in formal language theory that allows every nondeterministic finite state machine (NFA) to be transformed into a deterministic finite state machine (DFA). The DFA can be constructed explicitly and then run on the resulting input string one symbol at a time. Constructing the DFA for a regular expression of size m has the time and memory cost of O(2m), but it can be run on a string of size n in time O(n). An alternative approach is to simulate the NFA directly, essentially building each DFA state on demand and then discarding it at the next step, possibly with caching. This keeps the DFA implicit and avoids the exponential construction cost, but running cost rises to O(nm). The explicit approach is called the DFA algorithm and the implicit approach the NFA algorithm. As both can be seen as different ways of executing the same DFA, they are also often called the DFA algorithm without making a distinction. These algorithms are fast, but using them for recalling grouped subexpressions, lazy quantification, and similar features is tricky.[12][13]

The third algorithm is to match the pattern against the input string by backtracking. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification and force the algorithm to consider an exponentially increasing number of sub-cases. More complex implementations will often identify and speed up or abort common cases where they would otherwise run slowly.

Although backtracking implementations only give an exponential guarantee in the worst case, they provide much greater flexibility and expressive power. For example, any implementation which allows the use of backreferences, or implements the various extensions introduced by Perl, must use a backtracking implementation.

Some implementations try to provide the best of both algorithms by first running a fast DFA match to see if the string matches the regular expression at all, and only in that case perform a potentially slower backtracking match.

Benj
+4  A: 

Perl regexps, as those of any pattern language, where "backreferences" are allowed, are not actually "regular".

Backreferences is the mechanism of matching the same string that was matched by a sub-pattern before. For example, /^(a*)\1$/ matches only strings with even number of as, because after some as there should follow the same number of those.

It's easy to prove, that, for instance, pattern /^((a|b)*)\1$/ matches words from a non-regular language(*), so it's more expressive that ant finite automaton. Regular expressions can't "remember" a string of arbitrary length and then match it again (the length may be very long, while finite-state machine only can simulate finite amount of "memory").

A formal proof would use the pumping lemma. (By the way, this language can't be described by context-free grammar as well.)

Let alone the tricks that allow to use perl code in perl regexps (non-regular language of balanced parentheses there).


(*) "Regular languages" are sets of words that are matched by finite automata. I already wrote an answer about that.

Pavel Shved
+2  A: 

There was a recent discussion on this topic a Perlmonks: Turing completeness and regular expressions

daotoad
Rumor has it some of the more bizarre features were slipped in by Ilya to make Perl patterns Turing complete so he could write a chess program in a regex. (Wish I could find attribution on that)
Schwern