views:

285

answers:

5

This is more of a computer science question than a programming one, but I figure that this is the best place out of all the related sites to ask this.

When I discovered Regular Expressions and looked up the term I assumed that this property of "regularity" refers to the fact that the expression's language has a definable structural pattern. However, in reading about the subject and the theory behind this I learned that there are kinds of languages that are not regular, and yet from the way they are defined it's clear that a pattern can be matched to them. One such language is (a^n)(b^n). Clearly this is a pattern, and yet this is not a regular language. So now I'm left wondering what is it about regular languages that makes them regular, and this language not?

+4  A: 

The etymology of the name comes from Kleene's 1950s work describing regular sets using his mathematical notation created for the purpose. See this.

wallyk
@Barry Kelly: thanks for the typo fix. I had meant to go back and check the word.
wallyk
A: 

The word regular in regular expression refers to the Mathematical concept of regular, not the English concept. Just like how the word prime in mathematics bear little relation to prime beef.

It's inherited by CS (which is a branch of mathematics) to refer to a more specific concept: http://en.wikipedia.org/wiki/Regular_language

slebetman
A: 

regular expression are not really regular, the name is etymological.

Paul Creasey
Regexp IS regular but regex is not. Specifically, regex is what Perl calls its regexp-like syntax to distinguish it from traditional regexp. There are languages out there that still implement truly regular regexp: tcl and awk to name two.
slebetman
+1  A: 

Perhaps the Wikipedia article on regular languages can explain it better than we can. However, I'll give it a shot.

From a theoretical standpoint, a regular language (set of strings) is one that can be generated using a finite state automaton. In programmer terms, this is equivalent to saying it can be generated using regular expressions. Thus, all finite languages (sets of strings) are regular, but there are some infinite languages, such as anbn (the language of all strings of n a's followed by n b's) that cannot be recognized using a FSA or regular expressions. There are more powerful computational devices (such as modern computers, which are modeled using Turing Machines) which can recognize those languages.

The reason regular expressions are used so much in programming for string searching is that they can recognize the large majority of strings that are important to us programmers, and at the same time can be implemented to search very quickly using finite state automata.

BlueRaja - Danny Pflughoeft
Wrong. Programmers' regular expressions are usually **not** the way to define regular languages. RegExps are more generic (as they can recognize all the regular languages and many other languages).
Bruno Reis
What? Give me an example of a language which can be recognized by programmers' regexps but not theoretical regular expressions.
BlueRaja - Danny Pflughoeft
Not all regexp are regex. Some languages implement truly regular regexp instead of a clone of Perl's regex.
slebetman
See the 4th bullet on http://en.wikipedia.org/wiki/Regular_language . If you want more details, you should take and study any good book on formal language theory.
Bruno Reis
@BlueRaja: Perl's `/(.*)\1/` to match repeated words cannot be recognized by a truly regular language.
slebetman
Oh, a simple example: the language defined by (.*)\1, it is not regular, although I've just defined it through a regular expression. The problem is the back reference.
Bruno Reis
@slebetman: Are those non-regular expressions Perl-only? I've done regex in a number of languages, and I don't recognize that syntax.
BlueRaja - Danny Pflughoeft
@BlueRaja: those are extensions to the concept of "Regular Expression" that are implemented by some programming languages (including Perl)
Bruno Reis
Not sure why the -1 - though there are (apparently) extensions to regex to make it more powerful, this answer is still technically correct..
BlueRaja - Danny Pflughoeft
+10  A: 

Intuitively explaining computer science is... tricky. I'll give it a shot, but keep in mind that some of this is going to be "close enough" but not theoretically rigorous.

A regular language is one that can be decided by a machine that is computational equivalent to a finite automata (DFA/NDFA). A finite automata can be thought of as a machine that operates purely in states, no storage. So you can see that anbn cannot be regular as it requires a machine that can count the number of a's and b's (and thus must have infinite* storage capacity) in order to compare them.

For comparison, (abc)n is regular, because the number of repetitions is irrelevant.

For a more rigorous (and correspondingly denser view) check the wikipedia article and linked pages.

*The infinite doesn't matter here, but I mention it for completeness. It might be easier to think of it as "luckily, always just enough" storage.

Kevin Montrose
+1 for the "states, no storage" comment, I forgot to mention that.
BlueRaja - Danny Pflughoeft
I find its easiest to think like so: DFA/regular -> no storage, PDA/CFL -> infinite storage w/ restricted access, TM -> infinite storage w/ random access
Kevin Montrose