views:

829

answers:

4

There are some features in modern regex engines which allow you to match languages that couldn't be matched without that feature. For example the following regex using back references matches the language of all strings that consist of a word that repeats itself: (.+)\1. This language is not regular and can't be matched by a regex, which does not use back references.

Does lookaround also affect which languages can be matched by a regular expression? I.e. are there any languages that can be matched using lookaround, which couldn't be matched otherwise? If so, is this true for all flavors of lookaround (negative or positive lookahead or lookbehind) or just for some of them?

+17  A: 

The answer to the question you ask, which is whether a larger class of languages than the regular languages can be recognised with regular expressions augmented by lookaround, is no.

A proof is relatively straightforward, but an algorithm to translate a regular expression containing lookarounds into one without is messy.

First: note that you can always negate a regular expression (over a finite alphabet). Given a finite state automaton that recognises the language generated by the expression, you can simply exchange all the accepting states for non-accepting states to get an FSA that recognises exactly the negation of that language, for which there are a family of equivalent regular expressions.

Second: because regular languages (and hence regular expressions) are closed under negation they are also closed under intersection since A intersect B = neg ( neg(A) union neg(B)) by de Morgan's laws. In other words given two regular expressions, you can find another regular expression that matches both.

This allows you to simulate lookaround expressions. For example u(?=v)w matches only expressions that will match uv and uw.

For negative lookahead you need the regular expression equivalent of the set theoretic A\B, which is just A intersect (neg B) or equivalently neg (neg(A) union B). Thus for any regular expressions r and s you can find a regular expression r-s which matches those expressions that match r which do not match s. In negative lookahead terms: u(?!v)w matches only those expressions which match uw - uv.

There are two reasons why lookaround is useful.

First, because the negation of a regular expression can result in something much less tidy. For example q(?!u)=q($|[^u]).

Second, regular expressions do more than match expressions, they also consume characters from a string - or at least that's how we like to think about them. For example in python I care about the .start() and .end(), thus of course:

>>> re.search('q($|[^u])', 'Iraq!').end()
5
>>> re.search('q(?!u)', 'Iraq!').end()
4

Third, and I think this is a pretty important reason, negation of regular expressions does not lift nicely over concatenation. neg(a)neg(b) is not the same thing as neg(ab), which means that you cannot translate a lookaround out of the context in which you find it - you have to process the whole string. I guess that makes it unpleasant for people to work with and breaks people's intuitions about regular expressions.

I hope I have answered your theoretical question (its late at night, so forgive me if I am unclear). I agree with a commentator who said that this does have practical applications. I met very much the same problem when trying to scrape some very complicated web pages.

EDIT

My apologies for not being clearer: I do not believe you can give a proof of regularity of regular expressions + lookarounds by structural induction, my u(?!v)w example was meant to be just that, an example, and an easy one at that. The reason a structural induction won't work is because lookarounds behave in a non-compositional way - the point I was trying to make about negations above. I suspect any direct formal proof is going to have lots of messy details. I have tried to think of an easy way to show it but cannot come up with one off the top of my head.

To illustrate using Josh's first example of ^([^a]|(?=..b))*$ this is equivalent to a 7 state DFSA with all states accepting:

A - (a) -> B - (a) -> C --- (a) --------> D 
Λ          |           \                  |
|          (not a)       \               (b)
|          |              \               | 
|          v                \             v
(b)        E - (a) -> F      \-(not(a)--> G  
|            <- (b) - /                   |
|          |                              |
|         (not a)                         |
|          |                              |
|          v                              |
\--------- H <-------------------(b)-----/

The regular expression for state A alone looks like:

^(a([^a](ab)*[^a]|a(ab|[^a])*b)b)*$

In other words any regular expression you are going to get by eliminating lookarounds will in general be much longer and much messier.

To respond to Josh's comment - yes I do think the most direct way to prove the equivalence is via the FSA. What makes this messier is that the usual way to construct an FSA is via a non-deterministic machine - its much easier to express u|v as simply the machine constructed from machines for u and v with an epsilon transition to the two of them. Of course this is equivalent to a deterministic machine, but at the risk of exponential blow-up of states. Whereas negation is much easier to do via a deterministic machine.

The general proof will involve taking the cartesian product of two machines and selecting those states you wish to retain at each point you want to insert a lookaround. The example above illustrates what I mean to some extent.

My apologies for not supplying a construction.

FURTHER EDIT: I have found a blog post which describes an algorithm for generating a DFA out of a regular expression augmented with lookarounds. Its neat because the author extends the idea of an NFA-e with "tagged epsilon transitions" in the obvious way, and then explains how to convert such an automaton into a DFA.

I thought something like that would be a way to do it, but I'm pleased that someone has written it up. It was beyond me to come up with something so neat.

Francis Davey
-1 You can't build an RE to match even his simple example of words that repeat themselves... I think your proof goes awry in the paragraph about negative lookahead. You're still talking about simple RE closure properties without any bearing on lookaround.
Edmund
+1. Looks right.
Moron
I agree with Francis that lookaround is regular, but I think the proof is incorrect. The problem is that you can't break a regular expression with lookaround into two regular expressions A and B in general. Francis did so by transforming `u(?!v)w` into `uw` and `uv`, but I don't believe an algorithm exists to do this in general. Instead you can attach lookahead or neg(lookahead) to the original DFA at the point where it occurs with an epsilon transition. The details of this are a bit tricky, but I think it works.
Josh Haberman
Vote reversed, I stand corrected. :-)
Edmund
For example, consider the regex `^([^a]|a(?=..b))*$`. In other words, all characters are allowed, but every "a" must be followed three characters later with a "b". I don't believe you can reduce this to two regular expressions A and B that you are combining via union. I think you have to make the positive lookahead part of the NFA construction.
Josh Haberman
@Josh: Given language corresponding to uw say L1, to uv say L2 and u(?!v)w say L3, then L3 = L1 - L2. If L1 and L2 are regular, then so is L3. Why does one need any algorithm of any kind? Note, it is the set difference here, not union. In the other case, it is set _intersection_. Not sure where you got union from.
Moron
@Moron: clearly it works for the simple case of `u(?!v)w`, but to prove the proposition you have to show that it works for an *arbitrary* regex. The question is: how do you construct L1 and L2 from L3 given an *arbitrary* L3? I claim it cannot be done in general. For example, what are L1 and L2 for the regex I give above? My bad on the intersection vs. union.
Josh Haberman
u, v, w are _arbitrary_ regex. If you want a _formal_ proof, use induction on the number of lookaheads. Please don't ask for equivalent regular expressions, as the whole point of using lookaround is to not complicate matters. The expression corresponding to L1 - L2 or L1 intersection L2 can get pretty hairy. Speaking in terms of regular languages and their intersection/set difference is a perfectly valid proof. Why does one _have_ to construct anything?
Moron
@Moron: @Josh's point is that not every regex using lookaround has the form `u(?!v)w` or `u(?=v)w`. The lookaround can be part of an alternation or repetition. For his example you can't just say u is `^([^a]|a`, v is `..b` and w is `)*$` because then u and w aren't valid regexes.
sepp2k
You have to construct L1 and L2, because you have not otherwise actually proved that L1 and L2 exist. I do not believe they exist for the expression I showed above. How are you going to convince me that they do? My example has only a single lookahead, so if you really believe that your formulation covers the base case of a single lookahead and that you can prove for an arbitrary number of lookaheads via induction, you should be able to trivially translate my regex into L1 and L2. Otherwise you have not proved anything, and an induction-based proof would not prove anything either.
Josh Haberman
@Josh, sepp2k: For every regular language L, there is an equivalent regular expression and vice versa. Now a(?=..b) is regular, it corresponds to some expression say r. Now you have ([^a]|r)*, which is again regular. I believe this was proven by Kleene. Check this out: http://www.coli.uni-saarland.de/projects/milca/courses/coal/html/node44.html. The proof by induction does work. What you seem to be hung up upon is a fundamental fact about regular expressions and languages (my first sentence in this comment).
Moron
@Moron: you are presuming that lookahead expressions compose the same way that regular expressions do. You are presuming that `([^a]|r)*` matches the same language as `([^a]|a(?=..b))`, which *is not true*, even if `r` matches the same language as `a(?=..b)`. If you do the DFA expansion yourself you will see. Since lookahead matches characters without consuming them, it doesn't compose in the same way that regular expressions do. If you are still unconvinced of this I will post an actual DFA expansion later.
Josh Haberman
As a short proof of this, consider that `a(?=..b)` is the empty language, because `a ∩ a..b = ϵ`. So if we follow your line of reasoning `r = ϵ` and `([^a]|a(?=..b))*` is equivalent to `([^a]|ϵ)*` or just `[^a]*`. But this is clearly false because `aaab` matches the original regex but not the supposedly equivalent one.
Josh Haberman
@Josh: I see. You are right! Sorry for the noise. I should have thought about it more carefully. Thanks for clarifying it.
Moron
@Josh: As to your comment about DFA, non-deterministic finite automata (NFA) and deterministic finite automata (DFA) are equivalent, so I believe using an NFA to show the equivalence will probably be easier than DFA. In fact, there seems to be a name for what you were talking about NFA-e -> NFA with epsilon transitions. From: http://en.wikipedia.org/wiki/Nondeterministic_finite_state_machine
Moron
@Moron: NFA-e's are very easy and natural to write and easy to generate from regular expressions, but negating them is less easy since you can't just swap accepting and non-accepting states since an NFA succeeds if *any* path reaches an accepting state.
Francis Davey
@Francis: Yes, you do that after you convert to a DFA. I wasn't saying work _only_ with NFAs :-)
Moron
@Francis: I am not convinced by that blog post. Care to explain a bit more here? Tagging transitions seems to translate to condition evaluation during interpretation, which DFAs don't have (apart from simple input check to determine next state).
Moron
That DFA confuses me. How do you get out of state H? If it is epsilon transition, that DFA is incorrect (for instance, it accepts `aaabba`)
BlueRaja - Danny Pflughoeft
@BlueRaja: oops, my mistake, its a (b).
Francis Davey
@Moron - the tagged transitions are on arcs of NFA's not DFA's. They act as lookahead assertions on those arcs when running the NFA (you may only go along the arc if the assertion matches). When converting to an NFA, rather you keep track of (and differentiate between) states depending on what conditions were/needed to be met on reaching them. Without some graphics (I'm not at all convinced by my existing diagram) or a lot more space I'm not sure how to demonstrate this in action.
Francis Davey
@Francis: Even NFAs don't have conditions on arcs, except for the simple input tape symbol check. Even if we convert those arcs to DFA/NFAs, the problem becomes that we have to backtrack on what we consumed. So I am not sure if that blog proves that it is regular.
Moron
@Josh: btw, did you mean abbb for ([^a]| a(?=..b))* ? aaab does not look right. Can you please clarify?
Moron
@Moron: yes, you are correct, sorry about that.
Josh Haberman
By the way, here's my construction. During NFA construction, to concatenate a regex with a negative lookahead assertion `(?!r)`, perform the concatenation by adding an epsilon transition to an NFA for `r`. However r's final state(s) are in effect "anti-accept" states, meaning that the match fails if the NFA ends in that state, even if it also ends in an accept state. You could apply this rule during NFA->DFA minimization to make it into a run-of-the-mill DFA. You can do the same with positive lookahead by negating first, ending up with something like: http://imgur.com/cJSSO.png
Josh Haberman
Sorry, try http://imgur.com/XR2HO.png instead, the begin state should be an accept state.
Josh Haberman
@Josh. I don't think your construction works for the same reasons Francis' original proof didn't. I think you are saying the same thing which I had mistakenly said earlier. In order to determine that !r is matched, you have to consume input, and when the whole thing is followed by other expressions and/or included in a big Kleen Closure, you get wrong results because you have to undo the consumption of input done by the NFA for `r`, and that is not possible in the simple version of NFA. A 2way NFA (or 2-NFA) is suitable for that (please see my answer).
Moron
@Moron: I am pretty sure my construction does work. My construction is significantly different than Francis's. Mine does *not* have to consume input to determine that !r is matched, because an NFA can be in more than one state at a time. Look at http://i.imgur.com/XR2HO.png: an "a" transitions the NFA into *both* the begin state *and* an anti-accept state. Another `a` will put the NFA in multiple anti-accept states at the same time; any one of them will cause the match to fail. Only when the NFA ends in 0 anti-accept states and >=1 accept states does the match succeed.
Josh Haberman
@Josh: sounds right to me. I think where your approach ends up being similar to the one I had in mind - and to the augmentation of NFA's with special forms of epsilon transition that I linked to, is in transforming the NFA with anti-accept states to a DFA (and hence showing that the languages accepted are similar). Its intuitively obvious you can do this, but rather than each resulting state being labelled by a set of NFA states there will be some more labelling to do.
Francis Davey
@Moron, @Francis: I thought of a clearer way to explain my construction. Given `u(?=v)w` or `u(?!v)w`, I can construct this NFA: http://i.imgur.com/nmCMq.png where `f(v)` is the DFA for `v` modified to have anti-accept states as I described earlier. You can then continue the NFA construction with whatever surrounds this expression. This works because it runs `w` and `f(v)` in parallel. I like this construction because I think it would lead to the simplest implementation in practice, whether you ultimately convert the NFA to a DFA or not.
Josh Haberman
@Josh: NFA can be in multiple states, but those are on _different_ runs of the machine! On a single run of the machine, it will only be in one state at one time. You cannot combine multiple runs and say input is not consumed. If it accepts on _any_ of the runs, then it accepts the string. For instance in your case it looks like your machine, for, a(?=..b), will accept `a`. Also, don't you _have_ to consume input to determine if a regex has matched, whether deterministic or not. How can you do without it?
Moron
@Josh/Francis: NFA is not a parallel computation machine, where you have 'multiple processors' going off and concurrently dealing with the input! Even non-deterministic Turing machines don't have that. You can think of it as copies of the machine working parallely with their own copy of input tape.
Moron
@Moron: that's fair, I usually think of NFAs as parallel but your point is taken. My construction still works if you perform the traditional "set of subsets" NFA->DFA algorithm to my not-really-NFA and apply my rule for deciding whether each DFA state is an accept state or not. This gives you a traditional DFA that correctly recognizes the regex-with-lookaround, which proves that it is regular.
Josh Haberman
@Moron, @Francis: I expanded my explanation into my own answer below, see what you think.
Josh Haberman
@Josh: That is what I was saying before, till you pointed out a mistake in that! Point being, once you go through the lookahead NFA, you have consumed the input without any way of getting it back. With your construction, it is not a lookahead anymore, but actually a 'normal' regex, i.e. the construction gives (a..b | a) instead of a(?=..b) (at least if I understood you correctly). Your construction seems to be no different than an NFA for ..b. The e-transition only makes it same as a..b or ae. I don't think it works, but I have been wrong before. (this is referring to your comment, not answer)
Moron
A: 

I have a feeling that there are two distinct questions being asked here:

  • Are Regex engines that encorporate "lookaround" more powerful than Regex engines that don't?
  • Does "lookaround" empower a Regex engine with the ability to parse languages that are more complex than those generated from a Chomsky Type 3 - Regular grammar?

The answer to the first question in a practical sense is yes. Lookaround will give a Regex engine that uses this feature fundamentally more power than one that doesn't. This is because it provides a richer set of "anchors" for the matching process. Lookaround lets you define an entire Regex as a possible anchor point (zero width assertion). You can get a pretty good overview of the power of this feature here.

Lookaround, although powerful, does not lift the Regex engine beyond the theoretical limits placed on it by a Type 3 Grammar. For example, you will never be able to reliably parse a language based on a Context Free - Type 2 Grammar using a Regex engine equipped with lookaround. Regex engines are limited to the power of a Finite State Automation and this fundamentally restricts the expressiveness of any language they can parse to the level of a Type 3 Grammar. No matter how many "tricks" are added to your Regex engine, languages generated via a Context Free Grammar will always remain beyond its capabilities. Parsing Context Free - Type 2 grammar requires pushdown automation to "remember" where it is in a recursive language construct. Anything that requires a recursive evaluation of the grammar rules cannot be parsed using Regex engines.

To summarize: Lookaround provides some practical benefits to Regex engines but does not "alter the game" on a theoretical level.

EDIT

Is there some grammar with a complexity somewhere between Type 3 (Regular) and Type 2 (Context Free)?

I believe the answer is no. The reason is because there is no theoretical limit placed on the size of the NFA/DFA needed to describe a Regular language. It may become arbitrarily large and therefore impractical to use (or specify). This is where dodges such as "lookaround" are useful. They provide a short-hand mechanism to specify what would otherwise lead to very large/complex NFA/DFA specifications. They do not increase the expressiveness of Regular languages, they just make specifying them more practical. Once you get this point, it becomes clear that there are a lot of "features" that could be added to Regex engines to make them more useful in a practical sense - but nothing will make them capable of going beyond the limits of a Regular language.

The basic difference between a Regular and a Context Free language is that a Regular language does not contain recursive elements. In order to evaluate a recursive language you need a Push Down Automation to "remember" where you are in the recursion. An NFA/DFA does not stack state information so cannot handle the recursion. So given a non-recursive language definition there will be some NFA/DFA (but not necessarily a practical Regex expression) to describe it.

NealB
Is it necessarily true that a grammar more powerful than regular grammar must be as powerful as context-free? ie. is it known that there is *no* grammar "between" the two?
BlueRaja - Danny Pflughoeft
@BlueRaja: Exactly what I was thinking: the 'grammar continuum hypothesis' :-)
Moron
@Moron @BlueRaja - I have edited my answer for you. Hope it helps.
NealB
Of course there are many classes of grammars strictly between the class of regular grammars and the class of context-free grammars, including trivial examples such as the class of regular grammars together with the grammars for the language of balanced brackets. The [deterministic context free grammars](http://en.wikipedia.org/wiki/Deterministic_context-free_grammar) are a more useful example.
Christian Semrau
+3  A: 

As the other answers claim, lookarounds don't add any extra power to regular expressions.

I think we can show this using the following:

One Pebble 2-NFA (see the Introduction section of the paper which refers to it).

The 1-pebble 2NFA does not deal with nested lookaheads, but, we can use a variant of multi-pebble 2NFAs (see section below).

Introduction

A 2-NFA is a non deterministic finite automaton which has the ability to move either left or right on it's input.

A one pebble machine is where the machine can place a pebble on the input tape (i.e. mark a specific input symbol with a pebble) and do possibly different transitions based on whether there is a pebble at the current input position or not.

It is known the One Pebble 2-NFA has the same power as a regular DFA.

Non-nested Lookaheads

The basic idea is as follows:

The 2NFA allows us to backtrack (or 'front track') by moving forward or backward in the input tape. So for a lookahead we can do the match for the lookahead regular expression and then backtrack what we have consumed, in matching the lookahead expression. In order to know exactly when to stop backtracking, we use the pebble! We drop the pebble before we enter the dfa for the lookahead to mark the spot where the backtracking needs to stop.

Thus at the end of running our string through the pebble 2NFA, we know whether we matched the lookahead expression or not and the input left (i.e. what is left to be consumed) is exactly what is required to match the remaining.

So for a lookahead of the form u(?=v)w

We have the DFAs for u, v and w.

From the accepting state (yes, we can assume there is only one) of DFA for u, we make an e-transition to the start state of v, marking the input with a pebble.

From an accepting state for v, we e-transtion to a state which keeps moving the input left, till it finds a pebble, and then transitions to start state of w.

From a rejecting state of v, we e-transition to a state which keeps moving left until it finds the pebble, and transtions to the accepting state of u (i.e where we left off).

The proof used for regular NFAs to show r1 | r2, or r* etc, carry over for these one pebble 2nfas. See http://www.coli.uni-saarland.de/projects/milca/courses/coal/html/node41.html#regularlanguages.sec.regexptofsa for more info on how the component machines are put together to give the bigger machine for the r* expression etc.

The reason why the above proofs for r* etc work is that the backtracking ensures that the input pointer is always at the right spot, when we enter the component nfas for repetition. Also, if a pebble is in use, then it is being processed by one of the lookahead component machines. Since there are no transitions from lookahead machine to lookahead machine without completely backtracking and getting back the pebble, a one pebble machine is all that is needed.

For eg consider ([^a] | a(?=...b))*

and the string abbb.

We have abbb which goes through the peb2nfa for a(?=...b), at the end of which we are at the state: (bbb, matched) (i.e in input bbb is remaining, and it has matched 'a' followed by '..b'). Now because of the *, we go back to the beginning (see the construction in the link above), and enter the dfa for [^a]. Match b, go back to beginning, enter [^a] again two times, and then accept.

Dealing with Nested Lookaheads

To handle nested lookaheads we can use a restricted version of k-pebble 2NFA as defined here: Complexity Results for Two-Way and Multi-Pebble Automata and their Logics (see Definition 4.1 and Theorem 4.2).

In general, 2 pebble automata can accept non-regular sets, but with the following restrictions, k-pebble automata can be shown to be regular (Theorem 4.2 in above paper).

If the pebbles are P_1, P_2, ..., P_K

  • P_{i+1} may not be placed unless P_i is already on the tape and P_{i} may not be picked up unless P_{i+1} is not on the tape. Basically the pebbles need to be used in a LIFO fashion.

  • Between the time P_{i+1} is placed and the time that either P_{i} is picked up or P_{i+2} is placed, the automaton can traverse only the subword located between the current location of P_{i} and the end of the input word that lies in the direction of P_{i+1}. Moreover, in this sub-word, the automaton can act only as a 1-pebble automaton with Pebble P_{i+1}. In particular it is not allowed to lift up, place or even sense the presence of another pebble.

So if v is a nested lookahead expression of depth k, then (?=v) is a nested lookahead expression of depth k+1. When we enter a lookahead machine within, we know exactly how many pebbles have to have been placed so far and so can exactly determine which pebble to place and when we exit that machine, we know which pebble to lift. All machines at depth t are entered by placing pebble t and exited (i.e. we return to processing of a depth t-1 machine) by removing pebble t. Any run of the complete machine looks like a recursive dfs call of a tree and the above two restrictions of the multi-pebble machine can be catered to.

Now when you combine expressions, for rr1, since you concat, the pebble numbers of r1 must be incremented by the depth of r. For r* and r|r1 the pebble numbering remains the same.

Thus any expression with lookaheads can be converted to an equivalent multi-pebble machine with the above restrictions in pebble placement and so is regular.

Conclusion

This basically addresses the drawback in Francis's original proof: being able to prevent the lookahead expressions from consuming anything which are required for future matches.

Since Lookbehinds are just finite string (not really regexs) we can deal with them first, and then deal with the lookaheads.

Sorry for the incomplete writeup, but a complete proof would involve drawing a lot of figures.

It looks right to me, but I will be glad to know of any mistakes (which I seem to be fond of :-)).

Moron
I'm not sure this handles multiple lookaheads, eg `u(?=v)(?=w)(?=x)z` does it?
Francis Davey
When we exit the pebble 2NFA for a lookahead we are back at the input tape state where we entered, with a pebble to use, and depending on a match or not for the lookahead, we are in one of two different states (i.e we can tell if there was a match). So it seems like it will work by just concatenating the automata (with the extra states with e-transitions we added) as we always get back the pebble. But I guess it depends on how you interpret that expression. Is it same as u(?=vwx)z? or ((u(?=v))?=w)... etc?
Moron
The expression matches a u which must be followed by (non consuming) all three of v, w and x (where v, w and x are all general regular expressions) and a z. Having tried to build something that will solve this, I am fairly convinced that you cannot do it compositionally (i.e. by concatenating solutions).
Francis Davey
@Francis: If it has to match all of them, then concatenation works (I think). we concat it as dfa(u) -> peb2ndfa(v) -> peb2ndfa(w) -> dfa(x). If after matching u, we don't match v or w, we go back to u and pick up where we left off. If we match v, then because we backtrack after v is done, we can match w again (which again backtracks) and then match x. The key is that the 2NDFA allows us to back track and the pebble allows to know when to stop backtracking.
Moron
@sepp2k: Did you get a chance to read this answer? If you have any questions/clarification/counterexamples, I would be glad to answer.
Moron
Can this handle nested lookaheads? I think you'd need as many pebbles as the nesting is deep.
sepp2k
@sepp2k: No it does not handle nesting, but there are variants of multi-pebble automata which are regular. I will edit the answer to add that info.
Moron
@sepp2k: Edited to add that info. I believe we can use the variant of multi-pebble automata described in the paper I linked to, and that paper shows that it is regular.
Moron
@sepp2k: I have edited the answer to make it a little bit clearer on how to use the multi-pebble machine with constraints. Let me know if you have questions.
Moron
+5  A: 

I agree with the other posts that lookaround is regular (meaning that it does not add any fundamental capability to regular expressions), but I have an argument for it that is simpler IMO than the other ones I have seen.

I will show that lookaround is regular by providing a DFA construction. A language is regular if and only if it has a DFA that recognizes it. Note that Perl doesn't actually use DFAs internally (see this paper for details: http://swtch.com/~rsc/regexp/regexp1.html) but we construct a DFA for purposes of the proof.

The traditional way of constructing a DFA for a regular expression is to first build an NFA using Thompson's Algorithm. Given two regular expressions fragments r1 and r2, Thompson's Algorithm provides constructions for concatenation (r1r2), alternation (r1|r2), and repetition (r1*) of regular expressions. This allows you to build a NFA bit by bit that recognizes the original regular expression. See the paper above for more details.

To show that positive and negative lookahead are regular, I will provide a construction for concatenation of a regular expression u with positive or negative lookahead: (?=v) or (?!v). Only concatenation requires special treatment; the usual alternation and repetition constructions work fine.

The construction is for both u(?=v) and u(?!v) is:

http://imgur.com/ClQpz.png

In other words, connect every final state of the existing NFA for u to both an accept state and to an NFA for v, but modified as follows. The function f(v) is defined as:

  • Let aa(v) be a function on an NFA v that changes every accept state into an "anti-accept state". An anti-accept state is defined to be a state that causes the match to fail if any path through the NFA ends in this state for a given string s, even if a different path through v for s ends in an accept state.
  • Let loop(v) be a function on an NFA v that adds a self-transition on any accept state. In other words, once a path leads to an accept state, that path can stay in the accept state forever no matter what input follows.
  • For negative lookahead, f(v) = aa(loop(v)).
  • For positive lookahead, f(v) = aa(neg(v)).

To provide an intuitive example for why this works, I will use the regex (b|a(?:.b))+, which is a slightly simplified version of the regex I proposed in the comments of Francis's proof. If we use my construction along with the traditional Thompson constructions, we end up with:

alt text

The es are epsilon transitions (transitions that can be taken without consuming any input) and the anti-accept states are labeled with an X. In the left half of the graph you see the representation of (a|b)+: any a or b puts the graph in an accept state, but also allows a transition back to the begin state so we can do it again. But note that every time we match an a we also enter the right half of the graph, where we are in anti-accept states until we match "any" followed by a b.

This is not a traditional NFA because traditional NFAs don't have anti-accept states. However we can use the traditional NFA->DFA algorithm to convert this into a traditional DFA. The algorithm works like usual, where we simulate multiple runs of the NFA by making our DFA states correspond to subsets of the NFA states we could possibly be in. The one twist is that we slightly augment the rule for deciding if a DFA state is an accept (final) state or not. In the traditional algorithm a DFA state is an accept state if any of the NFA states was an accept state. We modify this to say that a DFA state is an accept state if and only if:

  • >= 1 NFA states is an accept state, and
  • 0 NFA states are anti-accept states.

This algorithm will give us a DFA that recognizes the regular expression with lookahead. Ergo, lookahead is regular. Note that lookbehind requires a separate proof.

Josh Haberman
In the machine you gave, you accept a. Which is not in (b | a(?=.b)). Also an anti-accept state is an accept state where a match fails? Then by definition of accept state, there are no anti-accept states! Or am i missing something?
Moron
@Moron: I think you are missing the meaning of my anti-accept states. Here is the same diagram, but with numbered states: http://imgur.com/ho4C8.png My machine does not accept `a`, because after matching `a` we can transition to states 4, 3, 1, and 5 (using the NFA->DFA algorithm). But state 5 is an anti-accept state, so following the rules at the bottom of my writeup, the DFA state corresponding to states 4, 3, 1, and 5 is *not* an accept state.
Josh Haberman
@Josh: Isn't the definition of `aa(v)` dependent on the string `s`? i.e. The set `aa(v)` can vary with `s`. Also you say that an anti-accept state starts out being an accept state. How can any match fail, then, if the machine ends up in that state? Sorry if I am reading it wrong.
Moron
@Moron: `aa(v)` just flips all accept states to be anti-accept states instead, so it should not depend on `s`. Both `v` and `aa(v)` are NFAs, not sets. I don't follow your last comment: it's true that in `v` there are accept states, but `aa(v)` does not have any accept states, and `aa(v)` is what actually ends up in the final NFA.
Josh Haberman
@Josh: Your definition: "Let aa(v) be a function on an NFA v that changes every _accept_ state into an anti-accept state..". So you change an accept state P into non-accept state, if the machine v ends at state P on some run and the match fails. But by definition (note v is still an NFA) of accepted state, if the machine ends there, the match has passed! And by set I meant, the set of states in v, you need to change into anti-accept, to make v into aa(v). Would that not depend on the string `s`? I hope I am clearer now.
Moron
@Moron: I modify the definition of an accept state in my not-quite-NFA and in the NFA->DFA algorithm (see the end of my post). By my definition, the string is accepted iff >=1 path ends in an accept state *and* 0 paths end in an anti-accept state. You can use this definition to create a traditional DFA without any crazy definitions that I made up. :)
Josh Haberman
@Moron: I'm still not seeing why you think `aa(v)` depends on `s`. `aa(v)` changes *all* accept states in `v` into anti-accept states, no more and no less.
Josh Haberman
@Josh: That is because of your crazy definition :-). A few more questions. Won't the new accept state you added, correspond to an accept state in the DFA you construct, even considering anti-accept states and would that not accept strings which should not be accepted? Also, how would you deal with nested lookaheads? Or stuff like (u(?=v)w) and (u(?=v)w)*? Sorry for the multiple questions, I am having a tough time wrapping my head around this (I blame my lack of sleep :-P).
Moron
+1 This looks clear and correct to me and is a neat way to demonstrate the equivalence.
Francis Davey
@Francis: I don't think this construction works. For instance consider u(?!r), where the empty string is in r. Then the begin state of dfa for r is accepting and so becomes anti-accept in Josh's construction. Then the NFA->DFA construction merges the accept states of u, the new accept state added by Josh and the begin state for r as one state (because of the e-transitions) and resulting dfa thus has no accept states! Now how do we concat this to w to form u(?!r)w ? We have no accept states to work with! If w is appended first, then depending on where we append, we will accept u(!r)w or uw.
Moron
...Anyway, this problem has given me a headache. I think I will take a break and stop bothering you guys in the process.
Moron
@Moron: I think this behavior of `u(?!)` having no accept states is actually correct; if you try in Perl to match `a` with `a(?!)`, it does not match. Also the empty string does not match `(?!)`. This makes sense because the negative lookahead assertion (the empty string) is matched, so the string as a whole does not match. On the other hand, `a` matches `a|(?!)`, and I believe my construction handles this correctly.
Josh Haberman
@Josh: For expressions of the form u(?!r) I agree (if empty string belongs to r), but when you start combining it, for eg concatenation or kleene closure, things go haywire because of the absence of any accepting states. So your construction (as it is currently) does not really prove that an arbitrary expression with lookaheads is regular. Also, it is not clear how to extend your construction to apply to nested lookaheads etc.
Moron
How will this construction behave for `a(!?b)|ab`? I think it will end up in an aa state because of `(?!b)`, but it should match because of the alternation.
sepp2k