views:

784

answers:

7

I didn't get the answer to this anywhere. What is the runtime complexity of a Regex match and substitution?

Edit: I work in python. But would like to know in general about most popular languages/tools (java, perl, sed).

+1  A: 

Depends on the implementation. What language/library/class? There may be a best case, but it would be very specific to the number of features in the implementation.

Chris
+9  A: 

From a purely theoretical stance:

The implementation I am familiar with would be to build a Deterministic Finite Automaton to recognize the regex. This is done in O(2^m), m being the size of the regex, using a standard algorithm. Once this is built, running a string through it is linear in the length of the string - O(n), n being string length. A replacement on a match found in the string should be constant time.

So overall, I suppose O(2^m + n).

theprise
Is the complexity still linear when considering backtracking as needed by some quantifiers?
Joey
I am interested in a proof for "This is done in O(2^m), m being the size of the regex, using a standard algorithm." How did you come up with it?
Masi
Is there a harmonic serie for the number of operations in the extreme situation?
Masi
O(2^m) is actually too pessimistic; you needn't create the DFA to run it :)
jpalecek
+1  A: 

To delve into theprise's answer, for the construction of the automaton, O(2^m) is the worst case, though it really depends on the form of the regular expression (for a very simple one that matches a word, it's in O(m), using for example the Knuth-Morris-Pratt algorithm).

OysterD
A: 

You can trade space for speed by building a nondeterministic finite automaton instead of a DFA. This can be traversed in linear time. Of course, in the worst case this could need O(2^m) space. I'd expect the tradeoff to be worth it.

Pramod
+1  A: 

Depends on what you define by regex. If you allow operators of concatenation, alternative and Kleene-star, the time can actually be O(m*n+m), where m is size of a regex and n is length of the string. You do it by constructing a NFA (that is linear in m), and then simulating it by maintaining the set of states you're in and updating that (in O(m)) for every letter of input.

Things that make regex parsing difficult:

  • parentheses and backreferences: capturing is still OK with the aforementioned algorithm, although it would get the complexity higher, so it might be infeasable. Backreferences raise the recognition power of the regex, and its difficulty is well
  • positive look-ahead: is just another name for intersection, which raises the complexity of the aforementioned algorithm to O(m^2+n)
  • negative look-ahead: a disaster for constructing the automaton (O(2^m), possibly PSPACE-complete). But should still be possible to tackle with the dynamic algorithm in something like O(n^2*m)

Note that with a concrete implementation, things might get better or worse. As a rule of thumb, simple features should be fast enough, and unambiguous (eg. not like a*a*) regexes are better.

jpalecek
+2  A: 

Other theoretical info of possible interest.

For clarity, assume the standard definition for a regular expression

http://en.wikipedia.org/wiki/Regular_language

from the formal language theory. Practically, this means that the only building material are alphabet symbols, operators of concatenation, alternation and Kleene closure, along with the unit and zero constants (which appear for group-theoretic reasons). Generally it's a good idea not to overload this term despite the everyday practice in scripting languages which leads to ambiguities.

There is an NFA construction that solves the matching problem for a regular expression r and an input text t in O(|r| |t|) time and O(|r|) space, where |-| is the length function. This algorithm was further improved by Myers

http://doi.acm.org/10.1145/128749.128755

to the time and space complexity O(|r| |t| / log |t|) by using automaton node listings and the Four Russians paradigm. This paradigm seems to be named after four Russian guys who wrote a groundbreaking paper which is not online. However, the paradigm is illustrated in these computational biology lecture notes

http://lyle.smu.edu/~saad/courses/cse8354/lectures/lecture5.pdf

I find it hilarious to name a paradigm by the number and the nationality of authors instead of their last names.

The matching problem for regular expressions with added backreferences is NP-complete, which was proven by Aho

http://portal.acm.org/citation.cfm?id=114877

by a reduction from the vertex-cover problem which is a classical NP-complete problem.

To match regular expressions with backreferences deterministically we could employ backtracking (not unlike the Perl regex engine) to keep track of the possible subwords of the input text t that can be assigned to the variables in r. There are only O(|t|^2) subwords that can be assigned to any one variable in r. If there are n variables in r, then there are O(|t|^2n) possible assignments. Once an assignment of substrings to variables is fixed, the problem reduces to the plain regular expression matching. Therefore the worst-case complexity for matching regular expressions with backreferences is O(|t|^2n).

Note however, regular expressions with backreferences are not yet full-featured regexen.

Take, for example, the "don't care" symbol apart from any other operators. There are several polynomial algorithms deciding whether a set of patterns matches an input text. For example, Kucherov and Rusinowitch

http://dx.doi.org/10.1007/3-540-60044-2_46

define a pattern as a word w_1@w_2@...@w_n where each w_i is a word (not a regular expression) and "@" is a variable length "don't care" symbol not contained in either of w_i. They derive an O((|t| + |P|) log |P|) algorithm for matching a set of patterns P against an input text t, where |t| is the length of the text, and |P| is the length of all the words in P.

It would be interesting to know how these complexity measures combine and what is the complexity measure of the matching problem for regular expressions with backreferences, "don't care" and other interesting features of practical regular expressions.

Alas, I haven't said a word about Python... :)

volodyako
A: 

If you're after matching and substitution, that implies grouping and backreferences.

Here is a perl example where grouping and backreferences can be used to solve an NP complete problem: http://perl.plover.com/NPC/NPC-3SAT.html

This (coupled with a few other theoretical tidbits) means that using regular expressions for matching and substitution is NP-complete.

Note that this is different from the formal definition of a regular expression - which don't have the notion of grouping - and match in polynomial time as described by the other answers.

Dave
I'm afraid I don't understand what you mean by "solving an NP-complete problem". Because of its NP-completeness, there is no efficient (polynomial-time) algorithm for computing instances of 3-CNF-SAT. One would need such an algorithm to *solve* the problem. The fact that one can encode 3-CNF-SAT with regex does not yet mean that regex solve 3-CNF-SAT because there is no polynomial-time algorithm for computing regex.
volodyako
I didn't initally mention polynomial time - I thought my answer implied that regex matching and substitution was in NP, although on reflection it wasn't all that clear from how I wrote it.I've cleaned up my answer a little, hopefully it clarifies things a little. Thanks for the feedback.
Dave