views:

410

answers:

3

Over the years, "regex" pattern matching has been getting more and more powerful to the point where I wonder: is it really just context-sensitive-grammar matching? Is it a variation/extension of context-free-grammar matching? Where is it right now and why don't we just call it that instead of the old, restrictive "regular expression"?

+6  A: 

In particular backreferences to capturing parentheses make regular expressions more complex than regular, context-free, or context-sensitive grammars. The name is simply historically grown (as many words). See also this section in Wikipedia and this explanation with an example from Perl.

Fabian Steeg
Could you please explain the difference between `regular language` and `regular expression`?
SealedSun
Is it really more powerful than CSG? Could you give an example?
notnot
A regular language can be described by a regular grammar (see http://en.wikipedia.org/wiki/Regular_grammar), while regular expressions are a pattern matching language that is less restricted and therefore more complex to process.
Fabian Steeg
Thanks for your comment notnot, I've added a link to a sample and some details.
Fabian Steeg
Hmmm... does this mean that we can match against arbitrary CSG's using today's tools.
notnot
Hm, not really sure what you are asking but I believe no, it simply means some regular expressions can't even be described with a CSG and so can't be parsed by a linear bounded automaton.
Fabian Steeg
Where are the boundary lines, then, on our regex tools? How can we write expressions that defy the limits of CSG's and yet not be able to express all CSG's?
notnot
I don't know, I just meant to say that the fact that some things work that cannot be expressed by a CFG does not necessarily imply that all things that can be will work too.
Fabian Steeg
Hey, I like the idea of defining "regex" as distinct from "regular expressions", and not just a contraction. Problem is, it still looks like a contraction.
13ren
+3  A: 

The way I see it:

  • Regular languages:
    • Matched by state machines. Only one variable can be used to represent the current "location" in the grammar to be matched: Recursion cannot be implemented
  • Context-free languages:
    • Matched by a stack machine. The current "location" in the grammar is represented by a stack in one or another form. Cannot "remember" anything that occurred before
  • Context-sensitive languages:
    • Most programming languages
    • All Most human languages

I do know of regular expression parsers that allow you to match against something the parser has already encountered, achieving something like a context-sensitive grammar.

Still, regular expression parsers, however sophisticated they may be, don't allow for recursive application of rules, which is a definite requirement for context-free grammars.

The term regex, in my opinion, mostly refers to the syntax used to express those regular grammars (the stars and question marks).

SealedSun
Lookahead/lookbehind and naming definitely add something that sits outside of standard regular expressions - memory. So aren't we at PDA level?
notnot
It's not in general true that natural language is context-sensitive, see http://www.eecs.harvard.edu/~shieber/Biblio/Papers/shieber85.pdf
Fabian Steeg
ah, that's the good stuff
notnot
Fascinating, thx for the link.
SealedSun
+1  A: 

There are features in modern regular expression implementations that break the rules of the classic regular expression definition.

For example Microsoft’s .NET Balancing Group (?<name1-name2> … ):

^(?:0(?<L>)|1(?<-L>))*(?(L)(?!))$

This does match the language L₀₁ = {ε, 01, 0011, 000111, … }. But this language is not regular according to the Pumping Lemma.

Gumbo
I know that it goes beyond classic regex, but I'm wondering how much further. Fabian's link above is interesting.
notnot