views:

138

answers:

4

Hi,

This article show that there is some regexp that is O(2^n) when backtracking. The example is (x+x+)+y. When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.

Is there a way to detect such regexp?

thanks

+2  A: 

No I don't think so, but you can use these guidelines:

  • If it contains two quantifiers that are open-ended at the high end and they are nested then it might be O(2^n).
  • If it does not contain two such quantifiers then I think it cannot be O(2^n).

Quantifiers that can cause this are: *, + and {k,}.

Also note that the worst case complexity of evaluating a regular expression might be very different from the complexity on typical strings and that the complexity depends on the specific regular expression engine.

Mark Byers
Yeap but you said "might be O(2^n)" are there a way to be sure? Is there a way like transforming the regexp so that it can be shown to be non exponential?
mathk
A: 

I think if we use regex to examine not very long strings, the backtracking will not be a problem. If we take just a line of text and check it against a pattern, it will not take too much time! Another point is avoiding to have too many "+" and "*" in patterns. Every "+ or *" indicates a loop.

Govan
+6  A: 

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:

echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"

both answer immediately.

In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.

In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.

You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/

To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.

UPDATE:

For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:

http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

Luther Blissett
+1  A: 

Any regex without backreferences can be matched in linear time, though many regex engines out there in the real world don't do it that way (at least many regex engines that are plugged into programming language runtime environments support backreferences, and don't switch to a more efficient execution model when no backreferences are present).

There's no easy way to find out how much time a regex with backreferences is going to consume.

moritz