views:

912

answers:

13

Okay, I barely understand RegEx basics, but why couldn't they design it to use keywords (like SQL) instead of some cryptic wildcard characters and symbols?

Is it for performance since the RegEx is interpreted/parsed at runtime? (not compiled)

Or maybe for speed of writing? Considering that when you learn some "simple" character combinations it becomes easier to type 1 character instead of a keyword?

+4  A: 

It's Perl's fault...!

Actually, more specifically, Regular Expressions come from early Unix development, and concise syntax was a lot more highly valued then. Storage, processing time, physical terminals, etc were all very limited, rather unlike today.

The history of Regular Expressions on Wikipedia explains more.

There are alternatives to Regex, but I'm not sure any have really caught on.

EDIT: Corrected by John Saunders: Regular Expressions were popularised by Unix, but first implemented by the QED editor. The same design constraints applied, even more so, to earlier systems.

Colin Pickard
Perl adopted the language, and is now trying to rectify the situation, by completely redesigning it for Perl6.
Brad Gilbert
+1  A: 

Because the idea of regular expressions--like many things that originate from UNIX--is that they are terse, favouring brevity over readability. This is actually a good thing. I've ended up writing regular expressions (against my better judgement) that are 15 lines long. If that had a verbose syntax it wouldn't be a regex, it'd be a program.

cletus
+1  A: 

It's actually pretty easy to implement a "wordier" form of regex -- please see my answer here. In a nutshell: write a handful of functions that return regex strings (and take parameters if necessary).

j_random_hacker
+5  A: 

Because it corresponds to formal language theory and it's mathematic notation.

Yossarian
+3  A: 

Actually, no, the world did not begin with Unix. If you read the Wikipedia article, you'll see that

In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions

John Saunders
+2  A: 

This is much earlier than PERL. The Wikipedia entry on Regular Expressions attributes the first implementations of regular expressions to Ken Thompson of UNIX fame, who implemented them in the QED and then the ed editor. I guess that the commands had short names for performance reasons, but much before being client-side. Mastering Regular Expressions is a great book about regular expressions, which offers the option to annotate a regular expression (with the /x flag) to make it easier to read and understand.

Yuval F
+10  A: 

Regular expressions have a mathematical (actually, language theory) background and are coded somewhat like a mathematical formula. You can define them by a set of rules, for example

  • every character is a regular expression, representing itself
  • if a and b are regular expressions, then a?, a|b and ab are regular expressions, too
  • ...

Using a keyword-based language would be a great burden for simple regular expressions. Most of the time, you will just use a simple text string as search pattern:

grep -R 'main' *.c

Or maybe very simple patterns:

grep -c ':-[)(]' seidl.txt

Once you get used to regular expressions, this syntax is very clear and precise. In more complicated situations you will probably use something else since a large regular expression is obviously hard to read.

Ferdinand Beyer
I love when regexes look like smileys :-[)
voyager
+29  A: 

You really want this?

Pattern findGamesPattern = Pattern.With.Literal(@"<div")
    .WhiteSpace.Repeat.ZeroOrMore
    .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
    .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal(@"-game""")
    .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
    .Literal(@"<!--gameStatus")
    .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
    .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal("-->");

Ok, but it's your funeral, man.

Download the library that does this here:
http://flimflan.com/blog/ReadableRegularExpressions.aspx

Jeff Atwood
Bah! Shameless blog marketing... for shame! :-)
cletus
A middle ground between g/re/p and what you describe is using comments for regular expressions, as suggested in my answer. Not easier for the regex creator, but easier for the readers of the code.
Yuval F
Thanks Jeff! At first I thought it was a very explanatory joke .... until I realized there actually was code to convert it to strings!
Jenko
This is actually not far off of my Python module, pyparsing, I think a pyparsing version of this would be something like: `findGamesPattern = ("<div" + ZeroOrMore('class="game"') + ZeroOrMore('id=') + Word(nums)("gameId") + "-game" + SkipTo("</div>")("content") + "</div>" + "<!--gameStatus" + Word("=") + Word(nums)("gameState") + "-->")`
Paul McGuire
+6  A: 

Well, if you had keywords, how would you easily differentiate them from actually matched text? How would you handle whitespace?

Source text Company: A Dept.: B

Standard regex:

Company:\s+(.+)\s+Dept.:\s+(.+)

Or even:

Company: (.+) Dept. (.+)

Keyword regex (trying really hard not get a strawman...)

"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)

Or simplified:

"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)

No, it's probably not better.

+1  A: 

I don't think keywords would give any benefit. Regular expressions as such are complex but also very powerful.

What I think is more confusing is that every supporting library invents its own syntax instead of using (or extending) the classic Perl regex (e.g. \1, $1, {1}, ... for replacements and many more examples).

0xA3
+1  A: 

I know its answering your question the wrong way around, but RegExBuddy has a feature that explains your regexpression in plain english. This might make it a bit easier to learn.

Toby Allen
+7  A: 

Perl 6 is taking a pretty revolutionary step forward in regex readability. Consider an address of the form: 100 E Main St Springfield MA 01234

Here's a moderately-readable Perl 5 compatible regex to parse that (many corner cases not handled):

 m/
     ([1-9]\d*)\s+
     ((?:N|S|E|W)\s+)?
     (\w+(?:\s+\w+)*)\s+
     (ave|ln|st|rd)\s+
     ([:alpha:]+(?:\s+[:alpha:]+)*)\s+
     ([A-Z]{2})\s+
     (\d{5}(?:-\d{4})?)
  /ix;

This Perl 6 regex has the same behavior:

grammar USMailAddress {
     rule  TOP { <addr> <city> <state> <zip> }

     rule  addr { <[1..9]>\d* <direction>?
                  <streetname> <streettype> }
     token direction { N | S | E | W }
     token streetname { \w+ [ \s+ \w+ ]* }
     token streettype {:i ave | ln | rd | st }
     token city { <alpha> [ \s+ <alpha> ]* }
     token state { <[A..Z]>**{2} }
     token zip { \d**{5} [ - \d**{4} ]? }
  }

A Perl 6 grammar is a class, and the tokens are all invokable methods. Use it like this:

if $addr ~~ m/^<USMailAddress::TOP>$/ {
     say "$<city>, $<state>";
}

This example comes from a talk I presented at the Frozen Perl 2009 workshop. The Rakudo implementation of Perl 6 is complete enough that this example works today.

Chris Dolan
+1  A: 

If the language you are using supports Posix regexes, you can use them.

An example:

\d

would be the same as

[:digit:]

The bracket notation is much clearer on what it is matching. I would still learn the "cryptic wildcard characters and symbols, since you will still see them in other people's code and need to understand them.

There are more examples in the table on regular-expressions.info's page.

gpojd