Python regular expression implementation details

views:

456

answers:

+2 Q:

Python regular expression implementation details

A question that I answered got me wondering:

How are regular expressions implemented in Python? What sort of efficiency guarantees are there? Is the implementation "standard", or is it subject to change?

I thought that regular expressions would be implemented as DFAs, and therefore were very efficient (requiring at most one scan of the input string). Laurence Gonsalves raised an interesting point that not all Python regular expressions are regular. (His example is r"(a+)b\1", which matches some number of a's, a b, and then the same number of a's as before). This clearly cannot be implemented with a DFA.

So, to reiterate: what are the implementation details and guarantees of Python regular expressions?

It would also be nice if someone could give some sort of explanation (in light of the implementation) as to why the regular expressions "cat|catdog" and "catdog|cat" lead to different search results in the string "catdog", as mentioned in the question that I referenced before.

+7 A:

Python's re module was based on PCRE, but has moved on to their own implementation.

Here is the link to the C code.

It appears as though the library is based on recursive backtracking when an incorrect path has been taken.

alt text

Regular expression and text size n
a?ⁿaⁿ matching aⁿ

Keep in mind that this graph is not representative of normal regex searches.

http://swtch.com/~rsc/regexp/regexp1.html

As for matching "catdog" with "cat|catdog" here is how the NFA probably looks like in Python alt text

The first path will probably go through cat, and then when there is no d at the end, it goes all the way back to the last branch.

Unknown 2009-05-09 23:03:17

+1: Use the source, Luke.

S.Lott 2009-05-10 01:13:13

(I realize this comment is late) I like your explanation, except I don't think the last part is correct about matching "cat|catdog". Using "cat|catdog" produces "cat" as a result and "catdog|cat" produces "catdog" as a result. Bbasically the order matters. There are two things going on. First of all, `findall` only finds all non-overlapping matches. So you shouldn't expect "cat" AND "catdog". Second, if I were implementing this, I think it's easy to tell the NFA can be converted to a DFA and so you would have "c -> a -> *t* -> d -> o -> *g*" where asterisks denote a final state.

Tom 2009-12-25 23:15:34

(continued...): So basically, the "t" is a final state, and I feel like the search should always just return "cat" because that's as far as it needs to go to find a match. Nevertheless, your answer was helpful, and I am going to accept it (months later :-).

Tom 2009-12-25 23:17:07

+4 A:

There are no "efficiency guarantees" on Python REs any more than on any other part of the language (C++'s standard library is the only widespread language standard I know that tries to establish such standards -- but there are no standards, even in C++, specifying that, say, multiplying two ints must take constant time, or anything like that); nor is there any guarantee that big optimizations won't be applied at any time.

Today, F. Lundh (originally responsible for implementing Python's current RE module, etc), presenting Unladen Swallow at Pycon Italia, mentioned that one of the avenues they'll be exploring is to compile regular expressions directly to LLVM intermediate code (rather than their own bytecode flavor to be interpreted by an ad-hoc runtime) -- since ordinary Python code is also getting compiled to LLVM (in a soon-forthcoming release of Unladen Swallow), a RE and its surrounding Python code could then be optimized together, even in quite aggressive ways sometimes. I doubt anything like that will be anywhere close to "production-ready" very soon, though;-).

Alex Martelli 2009-05-10 21:01:56

Matching regular expressions with backreferences is NP-hard, which is at least as hard as NP-Complete. That basically means that it's as hard as any problem you're likely to encounter, and most computer scientists think it could require exponential time in the worst case. If you could match such "regular" expressions (which really aren't, in the technical sense) in polynomial time, you could win a million bucks.

Doctor J 2009-10-28 04:24:21

ansaurus

tags:

views:

answers:

Python regular expression implementation details

related questions