ansaurus

Question

Why can't Regular Expressions use keywords instead of characters?

Answer 1

+4 A:

It's Perl's fault...!

Actually, more specifically, Regular Expressions come from early Unix development, and concise syntax was a lot more highly valued then. Storage, processing time, physical terminals, etc were all very limited, rather unlike today.

The history of Regular Expressions on Wikipedia explains more.

There are alternatives to Regex, but I'm not sure any have really caught on.

EDIT: Corrected by John Saunders: Regular Expressions were popularised by Unix, but first implemented by the QED editor. The same design constraints applied, even more so, to earlier systems.

Colin Pickard 2009-03-10 10:20:15

Perl adopted the language, and is now trying to rectify the situation, by completely redesigning it for Perl6.

Brad Gilbert 2009-03-17 23:13:19

Answer 2

+1 A:

Because the idea of regular expressions--like many things that originate from UNIX--is that they are terse, favouring brevity over readability. This is actually a good thing. I've ended up writing regular expressions (against my better judgement) that are 15 lines long. If that had a verbose syntax it wouldn't be a regex, it'd be a program.

cletus 2009-03-10 10:25:33

Answer 3

+1 A:

It's actually pretty easy to implement a "wordier" form of regex -- please see my answer here. In a nutshell: write a handful of functions that return regex strings (and take parameters if necessary).

j_random_hacker 2009-03-10 10:26:59

Answer 4

+5 A:

Because it corresponds to formal language theory and it's mathematic notation.

Yossarian 2009-03-10 10:27:46

Answer 5

+3 A:

Actually, no, the world did not begin with Unix. If you read the Wikipedia article, you'll see that

In the 1950s, mathematician Stephen Cole Kleene described these models using his mathematical notation called regular sets. The SNOBOL language was an early implementation of pattern matching, but not identical to regular expressions. Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions

John Saunders 2009-03-10 10:27:57

Answer 6

+2 A:

This is much earlier than PERL. The Wikipedia entry on Regular Expressions attributes the first implementations of regular expressions to Ken Thompson of UNIX fame, who implemented them in the QED and then the ed editor. I guess that the commands had short names for performance reasons, but much before being client-side. Mastering Regular Expressions is a great book about regular expressions, which offers the option to annotate a regular expression (with the /x flag) to make it easier to read and understand.

Yuval F 2009-03-10 10:29:22

Answer 7

+10 A:

Regular expressions have a mathematical (actually, language theory) background and are coded somewhat like a mathematical formula. You can define them by a set of rules, for example

every character is a regular expression, representing itself
if a and b are regular expressions, then a?, a|b and ab are regular expressions, too
...

Using a keyword-based language would be a great burden for simple regular expressions. Most of the time, you will just use a simple text string as search pattern:

grep -R 'main' *.c

Or maybe very simple patterns:

grep -c ':-[)(]' seidl.txt

Once you get used to regular expressions, this syntax is very clear and precise. In more complicated situations you will probably use something else since a large regular expression is obviously hard to read.

Ferdinand Beyer 2009-03-10 10:30:05

I love when regexes look like smileys :-[)

voyager 2009-11-16 01:40:56

Answer 8

+29 A:

You really want this?

Pattern findGamesPattern = Pattern.With.Literal(@"<div")
    .WhiteSpace.Repeat.ZeroOrMore
    .Literal(@"class=""game""").WhiteSpace.Repeat.ZeroOrMore.Literal(@"id=""")
    .NamedGroup("gameId", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal(@"-game""")
    .NamedGroup("content", Pattern.With.Anything.Repeat.Lazy.ZeroOrMore)
    .Literal(@"<!--gameStatus")
    .WhiteSpace.Repeat.ZeroOrMore.Literal("=").WhiteSpace.Repeat.ZeroOrMore
    .NamedGroup("gameState", Pattern.With.Digit.Repeat.OneOrMore)
    .Literal("-->");

Ok, but it's your funeral, man.

Download the library that does this here:
http://flimflan.com/blog/ReadableRegularExpressions.aspx

Jeff Atwood 2009-03-10 10:32:26

Bah! Shameless blog marketing... for shame! :-)

cletus 2009-03-10 10:37:39

A middle ground between g/re/p and what you describe is using comments for regular expressions, as suggested in my answer. Not easier for the regex creator, but easier for the readers of the code.

Yuval F 2009-03-10 10:43:24

Thanks Jeff! At first I thought it was a very explanatory joke .... until I realized there actually was code to convert it to strings!

Jenko 2009-03-10 12:30:47

This is actually not far off of my Python module, pyparsing, I think a pyparsing version of this would be something like: `findGamesPattern = ("<div" + ZeroOrMore('class="game"') + ZeroOrMore('id=') + Word(nums)("gameId") + "-game" + SkipTo("</div>")("content") + "</div>" + "")`

Paul McGuire 2009-10-14 03:40:15

Answer 9

+6 A:

Well, if you had keywords, how would you easily differentiate them from actually matched text? How would you handle whitespace?

Source text Company: A Dept.: B

Standard regex:

Company:\s+(.+)\s+Dept.:\s+(.+)

Or even:

Company: (.+) Dept. (.+)

Keyword regex (trying really hard not get a strawman...)

"Company:" whitespace.oneplus group(any.oneplus) whitespace.oneplus "Dept.:" whitespace.oneplus group(any.oneplus)

Or simplified:

"Company:" space group(any.oneplus) space "Dept.:" space group(any.oneplus)

No, it's probably not better.

2009-03-10 10:34:54

Answer 10

+1 A:

I don't think keywords would give any benefit. Regular expressions as such are complex but also very powerful.

What I think is more confusing is that every supporting library invents its own syntax instead of using (or extending) the classic Perl regex (e.g. \1, $1, {1}, ... for replacements and many more examples).

0xA3 2009-03-10 11:16:11

Answer 11

+1 A:

I know its answering your question the wrong way around, but RegExBuddy has a feature that explains your regexpression in plain english. This might make it a bit easier to learn.

Toby Allen 2009-03-10 11:51:43

Answer 12

+7 A:

Perl 6 is taking a pretty revolutionary step forward in regex readability. Consider an address of the form: 100 E Main St Springfield MA 01234

Here's a moderately-readable Perl 5 compatible regex to parse that (many corner cases not handled):

 m/
     ([1-9]\d*)\s+
     ((?:N|S|E|W)\s+)?
     (\w+(?:\s+\w+)*)\s+
     (ave|ln|st|rd)\s+
     ([:alpha:]+(?:\s+[:alpha:]+)*)\s+
     ([A-Z]{2})\s+
     (\d{5}(?:-\d{4})?)
  /ix;

This Perl 6 regex has the same behavior:

grammar USMailAddress {
     rule  TOP { <addr> <city> <state> <zip> }

     rule  addr { <[1..9]>\d* <direction>?
                  <streetname> <streettype> }
     token direction { N | S | E | W }
     token streetname { \w+ [ \s+ \w+ ]* }
     token streettype {:i ave | ln | rd | st }
     token city { <alpha> [ \s+ <alpha> ]* }
     token state { <[A..Z]>**{2} }
     token zip { \d**{5} [ - \d**{4} ]? }
  }

A Perl 6 grammar is a class, and the tokens are all invokable methods. Use it like this:

if $addr ~~ m/^<USMailAddress::TOP>$/ {
     say "$<city>, $<state>";
}

This example comes from a talk I presented at the Frozen Perl 2009 workshop. The Rakudo implementation of Perl 6 is complete enough that this example works today.

Chris Dolan 2009-03-15 17:52:57

Answer 13

+1 A:

If the language you are using supports Posix regexes, you can use them.

An example:

\d

would be the same as

[:digit:]

The bracket notation is much clearer on what it is matching. I would still learn the "cryptic wildcard characters and symbols, since you will still see them in other people's code and need to understand them.

There are more examples in the table on regular-expressions.info's page.

gpojd 2009-03-15 18:20:11

ansaurus

tags:

views:

answers:

Why can't Regular Expressions use keywords instead of characters?

related questions