tags:

views:

3615

answers:

32

Often when I see regular expressions, I only see a total mess of characters. Why does it have to be this way?

I guess what I really want to know is: are there alternatives to regular expressions that basically do the same thing but are implemented in a human readable language?

[UPDATE]

Thanks for all the great responses and inspiration!

I wanted to highlight this particular link which shows how a (working) alternative would look like, which may also be a good starting point for learning or "simple" regex expressions. But you also quickly get a feel for the verbosity tradeoff.

+4  A: 

Write a lot of code.

kenny
To do the same thing in a human readable language requires you to "Write a lot of code."
Gavin Miller
@LFSR Consulting, thanks for the interpretation, or I wouldn't be able to vote up
Nathan Fellman
+1  A: 

I think if you use them enough, you get used to them. I'm now to the point where I can look at a simple regex and see what it does. But the more complicated ones are beyond me still. Given enough time and projects that use them, I think I'll get better at reading them and quickly understanding what they do.

Thomas Owens
+35  A: 

Regular expressions are complex because they need to be able to express regular languages, which can be complex.

If you want something simpler, don't overlook the simple string manipulation options provided by your language. You can accomplish a lot with indexof, substring, and replace type methods.

Corbin March
Actually, most regex engines have gone far beyond regular languages, as pretty much every second question tagged with regex asked for balanced matching etc
Torsten Marek
+5  A: 

If you want to start seeing the regular expressions: Mastering Regular Expressions By Jeffrey Friedl is the book you want.

Regular expressions are a language to explain a language (IE a meta-language) thus they are inherently complicated.

If you want some background on why they're complicated, you can look at Finite State Machines.

Gavin Miller
Technically, they do not define languages, which can be recursive (although there are recursive extensions).
Ben Doom
+14  A: 

Just like most programming tools they aren't really complex, they just look scary when they aren't spaced properly. Big regex's can generally be broken down into manageable chunks to make them readable, people just don't bother since it's easier to put it on one line and hope it doesn't have any bugs in it.

EDIT: Just to make it more clear I'm going to add an example from some code I'm working on:
Ugly:

.*(\<FIXEDDD\>.*?)\<HH([\+\-]?\d+)?\>(.*)(\n.*)?
Better:
.*
(\<FIXEDDD\>.*?)
\<HH
([\+\-]?\d+)?
\>
(.*)
(\n.*)?
It's still not perfect, and could use some comments, but the second version at least allows you to see what clauses will be captured by the regex, and hopefully gives you a bit of a better idea of what's going on.

tloach
I didn't know you could "space" regexes ... if that's the case, this is sort of like programming assembler functions each on just one line.
steffenj
@steffenj: The regex is a string, you can build it such that there is whitespace to show the various clauses, even though the space can't be in the string itself.
tloach
Most regular expression flavours support flags that ignore any white space in the regular expression, and allow you to place actual comments within the regular expression.
Kibbee
The original "ugly" version is more readable, IMHO. It's certainly what I'm used to. When I first read the multiline version, I interpreted the backslashes as line continuation markers before realizing they weren't on every line.
rmeador
In Perl you would use the `/x` modifier. If you do make sure you escape any significant whitespace. i.e. ` *` with `\ *` or the even better `[ ]*`, which are now equivalent in Perl5.10.
Brad Gilbert
@Brad or even better, use \s
Nathan Fellman
+44  A: 

If you learn the regular expression language, it doesn't look like a big mess (or, at least, no more of a mess than any other language that you don't know). Making it "human readable" wouldn't be much better since you'd have to express complex ideas and relationships as long chains of words, which would be another sort of mess. The key, as with any language, is to practice it.

In Perl, (and maybe some other languages), you can write regular expressions with embedded comments and insignificant whitespace using the /x modifier to make them somewhat easier to read. This silly example demonstrates the idea:

$string =~ m/
     ^      #beginning of internal line with /m
     foo           # literal foo
     \s+           # any whitespace
     (ba           # remember in $1, starts with ba
       (?:           # non-memory grouping
          r|z           # r or z, doesn't matter which
       )              # end non-memory grouping
     )             # end $1
     /xm;

On the other side, there are tools, such as YAPE::Regex that explain a regex's function. That's not especially useful because these tools can't explain the regex's intent or broader purpose.

If you want to learn more about them in Perl, see my books Learning Perl or Mastering Perl, and if you want to just learn about regular expressions, try Mastering Regular Expressions by Jeffrey Friedl.

brian d foy
True. But before i start using abbreviations, i appreciate verbosity. I would think that it might be helpful to learn regex via expressions that are understandable, like "find first of" or "replace every occurance of this with that".
steffenj
I agree with you there.. But maybe we just need to be taught by example so that these understandable things can be translated into meaningful events
CheeseConQueso
"If you learn the regular expression language, it doesn't look like a big mess (or, at least, no more of a mess than any other language that you don't know)."The difference is that in normal languages you can have english words describing what's going on at various points, in the form of variable names and other identifiers. And for me that gets to the heart of the 'problem' with regex.
In Python you can use the re.X modifier to do the same thing. http://docs.python.org/library/re.html#re.X
Craig McQueen
+1  A: 

Here are a couple of links you might find useful:

RegEx Builder in code

Or Include Comments as you 'build' the expression.

Brian Schmitt
+5  A: 

If you stick to the fundamental regular expression operators Kleene's star (*), alternatives (|) and concatenation, you'll find the syntax is both succint and clear.

The problem you experience is due to all the bells and whistles.

However, there are alternative syntaxes available. One such example is SRE (Scheme Regular Expressions). As you probably have guessed, the operators are now in prefix position instead of infix. To see these expressions in action, skip to the section called "Short Tutorial" at http://www.scsh.net/docu/post/sre.html

soegaard
+15  A: 

They are cryptic because they are a DSL (Domain Specific Language) and they need to convey a lot of information in a few characters. It wouldn't be practical to use a long notation, e.g.:

match any char; match "a"; match any digit; endOfLine

This kind of program would look strange and long compared to the notations used today.

Cristian Ciupitu
to be honest I kind of like that syntax, it sort of reminds me of SQL in some way. Maybe regex could benefit from using modern SQL/Lynq-like syntax
Robert Gould
to be honest I kind of hate that syntax, it sort of reminds me of SQL in some way. ;) I agree that in theory, it appears good, but in practice it would get unweildy, especially for the complex regexes used in the real world.
sundar
+11  A: 
Kent Fredric
"I didn't know you could "space" regexes" -> in Perl, use the /x modifier
bart
@bart: yeah, I know, but I always found that specific notation style a bit fugly. ( and doesn't promote re-use of code-units ;) )
Kent Fredric
I don't think the verbose version is awful, it's actually rather nice and clear, and understandable, and maintainable. Lots of good things Regex aren't
Robert Gould
Using those variables helps to keep track of the regexp, which is good. So I would recommend using that kind of variable scheme with regexps for maintenance reasons. Though you can wrap the variables in order to hide them.
Silvercode
+6  A: 

Don't forget that old adage:

"Try to solve a problem with a regular expression, and now you've got two problems."

I believe the exact quote is "Some people, when confronted with a problem, think `I know, I’ll use regular expressions.' Now they have two problems." (by Jamie Zawinski).
Cristian Ciupitu
I would very much like to forget that old adage but I can't, because people keep quoting it whenever a regex question comes up. :D
Alan Moore
I don't know why this is getting upvoted at all, this doesn't even come close to answering the question.
sundar
+5  A: 

Regular expressions don't have to look complicated.

Check this out: http://www.regular-expressions.info/comments.html

It shows how to include comments in regular expressions and format them to be readable.

BoltBait
+11  A: 

Look at the bright side, they are no more cryptic to a person that doesn't know the regex language, than Chinese is to a person that doesn't know that language.

The point being is that they are cryptic looking because their goal is to express very complex behavior in a very compact pattern.

Robert Walker
+2  A: 

One reason is, that a there is a lot of matching on things you can't see. So that requires a representation for newlines, tabs, and spaces. Secondly, you have the meta-character problem, so you end up escaping many characters. For instance, how do you recognize brackets? Hint: your characters to match...go inside brackets!

gbarry
A: 

It seems like by the time you've figured out the syntax, unless you are a regular-regular expression user, you could've written a more verbose humanly friendly equivalent half an hour ago, such as:

"(any char(except numbers)) whitespace(unlimited) (eol)"

Maybe someone has come up with an alternative like this, a lexical parser would be (or I would find it) fun to write, which just translates it into a regex.

Chris S
There's this infact: http://weblogs.asp.net/rosherove/archive/2008/05/06/introducing-linq-to-regex.aspx
Chris S
real regex people laugh at you for taking so much to type '[^\d]\s*$'
Kent Fredric
In the same Perl programmers laugh at C programmers for not having $_ syntax for arrays? ;) At 70 words per minute I'm sure it didn't take that much longer to type, that was an over-simplified example however
Chris S
+1  A: 

Regular expressions look messy because the language is inherently dense. In addition, the use of punctuation creates a certain amount of cognitive dissonance.

plinth
+9  A: 

If you really think regular expressions are complicated, check out this code for parsing a very basic syntax:

/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
 if (regexp[0] == '^')
  return matchhere(regexp+1, text);
 do {    /* must look even if string is empty */
  if (matchhere(regexp, text))
   return 1;
 } while (*text++ != '\0');
 return 0;
}

/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
   if (regexp[0] == '\0')
       return 1;
   if (regexp[1] == '*')
       return matchstar(regexp[0], regexp+2, text);

   if (regexp[0] == '$' && regexp[1] == '\0')
       return *text == '\0';
   if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
       returnmatchhere(regexp+1, text+1);
   return 0;
}

/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
   do {   /* a * matches zero or more instances */
       if (matchhere(regexp, text))
    return 1;
   } while (*text != '\0' && (*text++ == c || c == '.'));
   return 0;
}

Understand that, and you'll understand the beauty of regular expressions. I believe it's by Kernighan.

Bobby Jack
+2  A: 

CPAN's Regexp::English is an implementation of regular expressions in English. It's verbose, and I don't think it does a better job of rendering an expression's intent than the traditional cryptic syntax, but it shows what you could do. Examples from the doc include:

my $re = Regexp::English->new()
 ->group()
  ->digit()
  ->or()
  ->word_char();

and

my $re = Regexp::English->new()
 ->remember()
   ->literal('root beer')
  ->or
   ->literal('milkshake')
 ->end();
Ross Patterson
+19  A: 

Regular expressions are a language written for a specific domain. Like many computer languages, regular expression code looks awful to people who haven't programmed in the language.

The major complications in reading regular expressions:

  1. The language is terse. (\d\d?)/(\d\d?)/(?\d\d)?(\d\d) can look fairly cryptic. Your reading speed needs to drop drastically when reading a regular expression.
  2. The language varies slightly in the implementations in C, C++, C#, Perl, Python, and Ruby. These variations tend to be small extensions.
  3. Regular expressions are often tasked to do too much because there is no equivalent simple, embeddable parsing language. Regular expressions were designed to parse tokens. A recursive descent parser is more suitable to a small grammar. The temptation of writing the whole grammar as a large regular expression can be overwhelming.

To date, no one has found a better syntax for describing parsing characters into tokens. Many have tried. Use a tool for debugging, parsing and testing regular expressions.

Charles Merriam
FYI: Digits are \d not /d.
R. Bemrose
In fact, regular expressions cannot describe many interesting languages.
Tetha
"To date, no one has found a better syntax for describing parsing characters into tokens." I take it that SNOBOL4's way doesn't count?
boost
+2  A: 

Jeff Atwood swears by RegexBuddy to make it simpler. I haven't used it, but it looks great.

Nathan Long
I swear by RegexBuddy too. Worth every dime I paid for it..
Scott Evernden
A: 

I admit too they were difficult to start with, but practice certainly does help you!

Check out these online Regex testers, very useful and great to start to learn with.

http://osteele.com/tools/rework/#

http://www.techeden.com/regex

Good luck!

alex
+1  A: 

Other good points are made at this question (sorry that its mine but I think it adds to this)

Robert Gould
+5  A: 

They don't have to be, but the folks who (over time) created the regular expression language decided to make it as compact as possible, using the symbols on the keyboard as the meta-symbols in the language. It just so happens that they chose a lot of the cryptic-looking symbols like *, $, +, ^, etc. Each of those symbols stands for an action or a concept in the language.

They certainly could have made it more verbose. Here's a fictional regex-like language that matches US ZIP codes.

(char-class '0' '9') repeat 5
( literal '-'
  (char-class '0' '9') repeat 4
) optional

I made up that language myself, using names and symbols that made sense to me. We could take it a step further and try to cut down on the verbosity by using single-character symbols for the operators. Let's replace char-class with the brackets []; repeat with braces {}, and optional with a question-mark ?. Literal values can stand on their own without any surrounding symbols.

[0-9]{5}
(
  -
  [0-9]{4}
)?

Finally, we'll just eliminate the extra whitespace.

[0-9]{5}(-[0-9]{4})?

Doesn't look so scary anymore, now that we know what each of the symbols represents.

Barry Brown
Well explained. It helped to answer a question I had as well.
canadiancreed
+1  A: 

Backus and Naur designed a meta-language (BNF) which could be used in describing or defining various kinds of languages. BNF can be used in describing any Type 3 language. You can use whatever names you like for nonterminals. Therefore you can make a human readable definition for any Type 3 language.

Now, it's common to use BNF for more complicated languages such as programming languages that need more complicated parsers. Yacc reads an input language that is based on BNF. But if you use BNF and a Yacc-like parser to process Type 3 languages, you waste a lot of CPU time.

Regular expressions are less powerful than BNF. Originally regular expressions could only describe Type 3 languages but not any more complicated languages. Corresponding to that kind of simplicity, Lex is faster than Yacc. Text editors like QED and vi could accept regular expressions in a single line, to search for matching strings. So from the point of view of computers and human users of text editors, regular expressions are simpler than BNF.

But yeah, regular expressions aren't easy to read. If you want to use BNF, use BNF.

Windows programmer
A: 

Somehow, it is a valid question. I learned quite recently REs, partly because they weren't available (natively) in my early languages (Basic, assembly, C...), partly because I found them overly complex.

Once I dived and learned them properly, I found them powerful, useful and not so complex once the eye is used to the syntax. Exploding complex expressions to multi-line syntax with indenting might help too.

They are complex because they use plain Ascii, making necessary to escape plain characters. It is even messier in languages like Java which have no way to disable escaping, so you need to double all backslashes.

But compactness is useful, somehow, you probably prefer to use /^\d{1,4}-\w+,?\d{3}$/ than several lines of parsing code. If you use Posix syntax for character classes, it might be more readable too.
Advantages of compactness can be argued (like Cobol, close of plain English, vs. APL, full of cryptic symbols), but now that regexes are well implanted in the programming world, it is a bit hard to change...

I would add that most REs I use are quite simple, like the above (you might find it not so simple... :-)). Newbies tend to complicate things with over-escaping and ignoring shortcuts: /^[0-9][0-9]?[0-9]?[0-9]?\-[A-Za-z0-9_]+\,?[0-9][0-9][0-9]$/ for the above expression isn't uncommon... (primitive regex implementations doesn't help...).
If you use excessively complex REs, with lot of pipe/alternatives, lookaround, etc., you might want to use broader REs and code to handle sub-cases: it is easier to write, understand and maintain, and might be much faster...

There are not so much alternatives, partly because there are so much REs ready to use...

Something to explore is Parsing expression grammar which is more powerful (context, nesting...) than REs and might be more readable, by using symbols.

But there aren't so much implementations. A good one is LPeg, for Lua. Here is an example of parsing arithmetic expressions:

-- Lexical Elements
local Space = lpeg.S(" \n\t")^0
local Number = lpeg.C(lpeg.P"-"^-1 * lpeg.R("09")^1) * Space
local FactorOp = lpeg.C(lpeg.S("+-")) * Space
local TermOp = lpeg.C(lpeg.S("*/")) * Space
local Open = "(" * Space
local Close = ")" * Space

-- Grammar
local Exp, Term, Factor = lpeg.V"Exp", lpeg.V"Term", lpeg.V"Factor"
G = lpeg.P{ Exp,
  Exp = lpeg.Ct(Factor * (FactorOp * Factor)^0);
  Factor = lpeg.Ct(Term * (TermOp * Term)^0);
  Term = Number + Open * Exp * Close;
}

G = Space * G * -1

It might be still a bit cryptic for the unprepared eye, but so are most new languages. Usage of English words helps the global understanding, though.

PhiLho
+1  A: 

I think the biggest problem (or anyway, the first problem) is that regexes are composed of the same characters they're supposed to match. You practically have to examine each character one-by-one to determine which ones are metacharacters. That up-front cost is, I believe, a big part of the reason why regex novices find them daunting and experts find them tedious. If they had their own set of dedicated symbols like math does, regexes would be a great deal easier to read, which would in turn make them easier to learn, debug, maintain, etc..

Alan Moore
A: 

SNOBOL contained a much more verbose pattern-matching syntax that was actually more powerful than regular expressions and somewhat easier to read. Unfortunately, it lost favor because of its verbosity. Regular expressions are valued as much because of their terseness as for their power. They get easier to read and construct the more you use them.

One factor which contributes to confusion in regular expressions is that not all features are implemented in the various languages that include them, and some use variations on the syntax. If you program in multiple languages, you have to be aware of these differences.

Ken Paul
A: 

Several people have noted that regular expressions must be cryptic because they are a DSL, expressing a lot in a small amount of space, and that they're less complicated than equivalent procedural code.

I buy all of that, but no one has mentioned history. The syntax could be a little less cryptic if it didn't have the burden of decades of backward compatibility. If someone were free to start from scratch today, they probably wouldn't have such awkward syntax for things like positive and negative look-ahead and look-behind.

John D. Cook
A: 

High-level functionality is achieved through largely high-level syntax. This can be seen in various language constructs of "StartsWith", "EndsWith", "Contains", "Like", etc.

Low-level functionality is achieved through largely low-level syntax. Regular expressions, when though of as a DSL, qualify as such.

joseph.ferris
A: 

Short answer: they are not.

Long answer: maybe because you are not used to them yet. It's like Neo: once you learn how to use it's power, you'll begin to "see the Matrix". :)

Marc
A: 

Assuming your programming language does not have regex literals, e.g. you are using C#, Java, VB etc…

Part of the problem is that you have to quote them before using them in your source code, so a regex of

"\b"

Becomes

"\"\\b\""

As a DSL they do not embed well into the host language

Ian Ringrose
No, that only applies to a language like Java that doesn't feature regex literals or raw/literal strings.
Alan Moore
Every mainstream statically type language I know does not have regex literals. So the “only applies to” equals 90% of the languages that large scale systems are written in!
Ian Ringrose
But there are several widely-used languages that _do_ have regex literals (Perl, Ruby and JavaScript spring to mind), and your advice doesn't apply to them. You should at least qualify your advice to account for them.
Alan Moore
A: 

Regex is difficult to read, I believe, because its meant to be used for short "throw away" scripts. The only alternative I've seen to regex are parser generators.

I actually made a parser generator, called LiPG (Lithium Parser Generator). Its still sort of "version 1" and needs some serious tweaks, but it works pretty well. For example:

parse digits
[in[int num]
    anychar[] X
    anyindex n
 -> X  X['0' <= X[n] <= '9'] cond[strlen(X) == num]
   [   return true;
   ]
]

parse somethingSimple
[   anychar[] X, Y
    anyindex n
 -> (digits[3] X digits[2] Y digits[4])
    X[X[n] == '-']   // every character of X should be a dash
    Y[Y[n] == '-']   // every character of Y should be a dash
    cond[ strlen(X)>0 && strlen(Y)>0 ]
    [    return true;
    ]
]

So that would something similar to the "readable regex" that you linked to (at http://flimflan.com/blog/ReadableRegularExpressions.aspx). Its alot more powerful than regex cause you can call a "parse function" inside the definition of another parse function, just like you can with normal functions.

Heres the documentation to LiPG: http://fresheneesz.110mb.com/LiPG/LiPG%20Documentation.html

B T