views:

70

answers:

3

Most UNIX regular expressions have, besides the usual *,+,? operators a backslash operator where \1,\2,... match whatever's in the last parentheses, so for example L=(a)b\1* matches the (non regular) language a^n b a^n

On one hand, this seems to be pretty powerful since you can create (a*)b\1b\1 to match the language a^n b a^n b a^n which can't even be recognized by a stack automaton. On the other hand, I'm pretty sure a^n b^n cannot be expressed this way.

Two questions: 1. Is there any literature on this family of languages (UNIX-y regular). In particular, is there a version of the pumping lemma for these? 2. Can someone prove (or perhaps disprove) that a^n b^n cannot be expressed this way?

Thanks

A: 

a^n b^n is CFL. The grammar is

A -> aAb | e

you can use pumping lemma for RL to prove A is not RL

sza
Yup, "regular" expressions have been extended to recognize languages that finite state machines will not recognize.
WhirlWind
this doesn't answer the question. he wants a statement of a theorem (similar to the pumping lemma) that he can use to prove what regular expressions can support when they support backreferences.
Ken Bloom
@ken: i believe he changed the question after I answered....
sza
@ziang. The question has never been edited.
Ken Bloom
@ken. His original question is asking why a^n b^n is CFL and how to prove it is not a RL. And WhirlWind also noticed that. You can ask Avi and you can also report to the website admin about the untracked modified thread.
sza
@ziang. He didn't. He asked you to prove whether a^n b^n can be recognized by **Extended** regexps which allow backreferences.
Ken Bloom
A: 

Ruby 1.9.1 supports the following regex:

regex = %r{ (?<foo> a\g<foo>a | b\g<foo>b | c) }x

p regex.match("aaacbbb")
# the result is #<MatchData "c" foo:"c">

PragDave has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:

sentence = %r{ 
    (?<subject>   cat   | dog   | gerbil    ){0} 
    (?<verb>      eats  | drinks| generates ){0} 
    (?<object>    water | bones | PDFs      ){0} 
    (?<adjective> big   | small | smelly    ){0} 

    (?<opt_adj>   (\g<adjective>\s)?     ){0} 

    The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object> 
}x

I think this means that at least Ruby 1.9.1's regex engine (which is the Oniguruma regex engine) is actually equivalent to a context-free grammar (though the capturing groups aren't as useful as an actual parser-generator.)

This means that the Context Free pumping lemma should describe the class of languages recognizable by Ruby 1.9.1's regex engine.

EDIT: Whoops! I messed up, and didn't do an important test which actually makes my answer above totally wrong. (I'd rather not just delete the answer, because it's useful information nonetheless.)

regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
#I added anchors for the beginning and end of the string
regex.match("aaacbbb")
#returns nil, indicating that no match is possible with recursive capturing groups.
Ken Bloom
+1  A: 

You're probably looking for

and of course follow their citations forward and backward to find more literature on this subject.

Ken Bloom