tags:

views:

2795

answers:

21

On the one hand, there are many people who seem to see regular expressions as the holy grail. Something that looks so complicated just must be the answer to any question. They think that every problem is solvable using regular expressions.

On the other hand, there are also many people who try to avoid regular expressions at any cost. They try to find a way around regular expressions and accept additional coding just for the sake of it, even if a regular expressions would be the easiest solution.

Why are regular expressions considered so controversial? Is there widespread misunderstanding about how they work? Or could it be a broad belief that regular expressions are generally slow?

+8  A: 

"Regular Expressions: Now You Have Two Problems" is a great article from Jeff Atwood on the matter. Basically, regular expressions are "hard"! They can create new problems. They are effective, however.

Tony k
+56  A: 

I don't think people object to regular expressions because they're slow, but rather because they're hard to read and write, as well as tricky to get right. While there are some situations where regular expressions provide an effective, compact solution to the problem, they are sometimes shoehorned into situations where it's better to use an easy-to-read, maintainable section of code instead.

Kyle Cronin
+40  A: 

Regexes are a great tool, but people think "Hey, what a great tool, I will use it to do X!" where X is something that a different tool is better for (usually a parser). It is the standard using a hammer where you need a screwdriver problem.

Chas. Owens
Just remember that most parsers -lexical analyzers- still use regular expressions to parse their stuff :-)
Jasper Bekkers
Saying that parsers use regular expressions is like saying parsers use assignment statements. It means nothing until you look to see how they are being used.
Chas. Owens
Using a RegEx when a parser is better is annoying. Using a RegEx when the language's standard string find or replace functions will work (and in linear time usually) is just unforgivable.
jmucchiello
Agreed, because a RegEx has to be a jack of all trades it's processing overhead is huge. Just because using a RegEx engine seems easy doesn't mean it's a better solution over an iterative parser (developer dependent threshold). One of my favourite examples PHP's `split($pattern,$string)` vs `explode($delimiter,$string)` - thankfully the former is getting depreciated, but lots of code used the former when they only needed the power of the later.Aggreed, RegEx's provide an easy tool to do some things but unless you need the full power of regular expressions they
Rudu
+3  A: 

You almost may as well be asking about why goto's are controversial.

Basically, when you get so much "obvious" power, people are apt to abuse them for situations they aren't the best option for. The number of people asking to parse CSVs or XML or HTML in regexes, for example, astounds me. It's the wrong tool for the job. But some users insist on using regexes anyway.

Personally, I try to find that happy medium - use regexes for what they're good for, and avoid them when they're less than optimal.

Note that regexes can still be used to parse CSVs, XML, HTML, etc. But usually not in a single regex.

Tanktalus
Sure you can parse any of these formats in a single regex, that's the power of regexes, baby!Whether or not you want to do that, is a different matter entirely.
Jasper
+2  A: 

Regular expressions are a serious mystery to a lot of people, including myself. It works great but it's like looking at a math equation. I'm happy to report though that somebody has finally created a consolidated location of various regular expression functions at http://regexlib.com/. Now if Microsoft would only create a regular expression class that would automatically do much of the common stuff like eliminating letters, or filtering dates.

Al Katawazi
You're missing the point. The idea of regexes is that you invest some time in learning them and when you are done, you no longer need some magical "read a date" class. Instead, it takes very little effort regex for them. Moreover, it will take just as little effort to write one for a "yyyy/mm/dd" as it takes to write one for "mm-dd-yyyy", or even one for "mm-yyyy/dd" (which won't happen to often, but it's an example of how you can do things that a magical class never can").
Jasper
+23  A: 

People tend to think regular expressions are hard; but that's because they're using them wrong. Writing complex one-liners without any comments, indenting or named captures. (You don't cram your complex SQL expression in one line, without comments, indenting or aliases, do you?). So yes, for a lot of people, they don't make sense.

However, if your job has anything to do with parsing text (roughly any web-application out there...) and you don't know regular expression, you suck at your job and you're wasting your own time and that of your employer. There are excellent resources out there to teach you everything about them that you'll ever need to know, and more.

Jasper Bekkers
Well .. the difference is that multiple spaces have meaning in regex, where in other languages they don't and that's why they are usually one liners (that sometimes wrap to multiple lines :)
Rado
@Rado: In that case it's usually easier to make them explicit as [ ] or \s
Jasper Bekkers
@Rado: Perl, for instance, has the `x` modifier for regexes that causes whitespace to be ignored. This allows you to put the regex on a few lines and add comments.
Nathan Fellman
Likewise Python has `re.X` a.k.a. `re.VERBOSE`.
Craig McQueen
Likewise the `x` modifier in tcl. I believe it's quite standard since tcl, unlike other languages, does not use PCRE.
slebetman
+8  A: 

I don't think they're that controversial.

I also think you've sort of answered your own question, because you point out how silly it would be to use them everywhere (Not everything is a regular language 2) or to avoid using them at all. You, the programmer, have to make an intelligent decision about when regular expressions will help the code or hurt it. When faced with such a decision, two important things to keep in mind are maintainability (which implies readability) and extensibility.

For those that are particularly averse to them, my guess is that they've never learned to use them properly. I think most people who spend just a few hours with a decent tutorial will figure them out and become fluent very quickly. Here's my suggestion for where to get started:

http://docs.python.org/howto/regex

Although that page talks about regular expressions in the context of Python, I've found the information is very applicable elsewhere. There are a few things that are Python-specific, but I believe they are clearly noted, and easy to remember.

allyourcode
I like the Python regex page. Thanks.
Mark Stock
The page has seemed to move to http://docs.python.org/howto/regex
DMan
@DMan Thanks. I'll edit my answer to reflect.
allyourcode
+2  A: 

The problem is that regexes are potentially so powerful that you can do things with them that you should use something different for.

A good programmer should know where to use them, and where not. The typical example is parsing non-regular languages (see Deciding whether a language is regular).

I think that you can't go wrong if you at first restrict yourself to real regular expressions (no extensions). Some extensions can make your life a bit easier, but if you find something hard to express as a real regex, this may well be an indication that a regex is not the right tool.

Svante
+19  A: 

Regular expressions allow you to write a custom finite-state machine (FSM) in a compact way, to process a string of input. There are at least two reasons why using regular expressions is hard:

  • Old-school software development involves a lot of planning, paper models, and careful thought. Regular expressions fit into this model very well, because to write an effective expression properly involves a lot of staring at it, visualizing the paths of the FSM.

    Modern software developers would much rather hammer out code, and use a debugger to step through execution, to see if the code is correct. Regular expressions do not support this working style very well. One "run" of a regular expression is effectively an atomic operation. It's hard to observe stepwise execution in a debugger.

  • It's too easy to write a regular expression that accidentally accepts more input than you intend. The value of a regular expression isn't really to match valid input, it's to fail to match invalid input. Techniques to do "negative tests" for regular expressions are not very advanced, or at least not widely used.

    This goes to the point of regular expressions being hard to read. Just by looking at a regular expression, it takes a lot of concentration to visualize all possible inputs that should be rejected, but are mistakenly accepted. Ever try to debug someone else's regular expression code?

If there's a resistance to using regular expressions among software developers today, I think it's chiefly due to these two factors.

Bill Karwin
There are excellent tools out there to debug regexps: http://www.regexbuddy.com/
Jasper Bekkers
perl -Mre=debug -e "q[aabbcc]=~/ab*[cd]/"
Brad Gilbert
+13  A: 

Because they lack the most popular learning tool in the commonly accepted IDEs: There's no Regex Wizard. Not even Autocompletion. You have to code the whole thing all by yourself.

le dorfier
Funny, but contains some sad truth.
Svante
Then you're using the wrong IDE... Even my text editor provides regex hints.
CurtainDog
The point is that some can't manage very well without it. But what editor are you referring to? And how does it relate to IDE features?
le dorfier
On a side note, Expresso and The Regex Coach are very useful tools for constructing regular expressions.
Mun
How in the world would you autocomplete a regular expression?
AmbroseChapel
Autocompletes could bring up character sets, greedy vs possessive vs non-greedy matches, look ahead and look behind, also bracket matching, etc. Regexes are succinct but there is still some room for help from the editor.
CurtainDog
EditPad Pro has syntax highlighting for regexes in the search box, but I find it more annoying than helpful, and keep it turned off. But I do appreciate it letting me know when I have unmatched brackets; parentheses in particular can be a bear to keep track of.
Alan Moore
Use Expresso! Regex don't need a wizard their easy to write.
wonea
A: 

Get RegexBuddy. Then you'll be flinging regular expressions around like a professional and as a !!bonus!! you start understanding them!

Scott Evernden
Ahem... So you are promoting using something you don't understand?
Eduardo León
@Eduardo León: no he's not. As far as I can't tell he has not said that he does not understand them.
nico
A: 

While I think regexes are an essential tool, the most annoying thing about them is that there are different implementations. Slight differences in syntax, modifiers, and -especially- "greed" can make things really chaotic, requiring trial-and-error and sometimes generating puzzling bugs.

ndr
+20  A: 

Almost everyone I know who uses regular expressions regularly (pun intended) comes from a Unix-ish background where they use tools that treat REs as first-class programming constructs, such as grep, sed, awk, and Perl. Since there's almost no syntactic overhead to use a regular expression, their productivity goes way up when they do.

In contrast, programmers who use languages in which REs are an external library tend not to consider what regular expressions can bring to the table. The programmer "time-cost" is so high that either a) REs never appeared as part of their training, or b) they don't "think" in terms of REs and prefer to fall back on more familiar patterns.

Barry Brown
Good and interesting point. I'd never thought of that aspect.
Dave Sherohman
Yeah, I never forgave Python for making the regex syntax verbose by using a library. I think it's purity over sanity.
Reinis I.
A: 

I find regular expressions invaluable at times. When I need to do some "fuzzy" searches, and maybe replaces. When data may vary and have a certain randomness. However, when I need to do a simple search and replace, or check for a string, I do not use regular expressions. Although I know many people who do, they use it for everything. That is the controversy.

If you want to put a tack in the wall, don't use a hammer. Yes, it will work, but by the time you get the hammer, I could put 20 tacks in the wall.

Regular expressions should be used for what they were designed for, and nothing less.

Brent Baisley
A: 

The best valid and normal usage for regex is for email address format validation.

That's a good application of it.

I have used regular expressions countless times as one-offs in TextPad to massage flat files, create csv files, create SQL insert statements and that sort of thing.

Well written regular expressions shouldn't be too slow. Usually the alternatives, like tons of calls to Replace are far slower options. Might as well do it in one pass.

Many situations call for exactly regular expressions and nothing else.

Replacing special non-printing characters with innocuous characters is another good usage.

I can of course imagine that there are some codebases that overuse regular expressions to the detriment of maintainability. I have never seen that myself. I have actually been eschewed by code reviewers for not using regular expressions enough.

Christopher Morley
Experience shows that regexes are actually a pretty poor tool for email address format validation. A truly complete format validator implemented as a regex is a multi-hundred-character monstrosity, while most of the shorter "good enough" validators that most people take 5 minutes to create will reject large categories of valid, deliverable addresses.
Dave Sherohman
I hear ya dude. I was talking about the "good enough" and while the large swaths may be large in theory, consider the percentage of coverage you get in such a short expression. I too have seen the monstrosity, but what is your elegant alternative?
Christopher Morley
I've used something like \w@\w+.\w+ to find email address quickly in a huge directory of files where speed was important and a few false positives or false negatives wasn't important. But the best way to validate an email address seems to be to send email to it.
RossFabricant
Yeah email the address spec is a nasty mess http://stackoverflow.com/questions/611775/regular-expression-for-valid-email-address-closed
Nick
+2  A: 

I don't think "controversial" is the right word.

But I've seen tons of examples where people say "what's the regular expression I need to do such-and-such a string manipulation?" which are X-Y problems.

In other words, they've started from the assumption that a regex is what they need, but they'd be better off with a split(), a translation like perl's tr/// where characters are substituted one for the other, or just an index().

AmbroseChapel
+4  A: 

Regular expressions are to strings what arithmetic operators are to numbers, and I wouldn't consider them controversial. I think that even a fairly millitant OO activist like myself (who would tend to choose other objects over strings) would be hard pressed to reject them.

CurtainDog
+1  A: 

In some cases I think you HAVE to use them. For instance to build a lexer.

In my opinion, this is a point of view of people who can write regexp and people who don't (or hardly). I personnaly thing this is a good think for example to valid the input of a form, be it in javascript to warn the user, or in server-side language.

Aif
A: 

I think it is a lesser known technique among programmers. So, there is not a wide acceptance for it. And if you have a non-technical manager to review your code or review your work then a regular expression is very bad. You will spend hours writing a perfect regular expression, and you will get few marks for the module thinking he/she has written so few lines of code. Also, as said elsewhere, reading regular expressions are very difficult task.

Satya Prakash
+1  A: 

This is an interesting subject.
Many regexp aficionados seem to confuse the conciseness of the formula with efficiency.
On top of that, a regexp that requires a lot of thought produces to its author a massive satisfaction that makes it legitimate straight away.

But... regexps are so convenient when performance is not an issue and you need to deal quickly with a text output, in Perl for instance. Also, while performance is an issue one may prefer not to try to beat the regexp library by using a homemade algorithm that may be buggy or less efficient.

Besides there are a number of reasons for which regexps are unfairly criticized, for instance

  • the regexp is not efficient, because building the top one is not obvious
  • some programmers "forget" to compile only once a regexp to be used many times (like a static Pattern in Java)
  • some programmers go for the trial and error strategy - works even less with regexps!
ring0
A: 

Making Regexes Maintainable

A major advance toward demystify the patterns previously referred to as “regular expressions” is Perl’s /x regex flag — sometimes written (?x) when embedded — that allows whitespace (line breaking, indenting) and comments. This seriously improves readability and therefore maintainability. The white space allow for cognitive chunking, so you can see what groups with what.

Modern patterns also now support both relatively numbered and named backreferences now. That means you no longer need to count capture groups to figure out that you need $4 or \7. This helps when creating patterns that can be included in further patterns.

Here is an example a relatively numbered capture group:

$dupword = qr{ \b (?: ( \w+ ) (?: \s+ \g{-1} )+ ) \b }xi;
$quoted  = qr{ ( ["'] ) $dupword  \1 }x;

And here is an example of the superior approach of named captures:

$dupword = qr{ \b (?: (?<word> \w+ ) (?: \s+ \k<word> )+ ) \b }xi;
$quoted  = qr{ (?<quote> ["'] ) $dupword  \g{quote} }x;

Grammatical Regexes

Best of all, these named captures can be placed within a (?(DEFINE)...) block, so that you can separate out the declaration from the execution of individual named elements of your patterns. This makes them act rather like subroutines within the pattern.
A good example of this sort of “grammatical regex” can be found in this answer and this one. These look much more like a grammatical declaration.

As the latter reminds you:

… make sure never to write line‐noise patterns. You don’t have to, and you shouldn’t. No programming language can be maintainable that forbids white space, comments, subroutines, or alphanumeric identifiers. So use all those things in your patterns.

This cannot be over-emphasized. Of course if you don’t use those things in your patterns, you will often create a nightmare. But if you do use them, though, you need not.

Here’s another example of a modern grammatical pattern, this one for parsing RFC 5322: use 5.10.0;

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

Isn't that remarkable — and splendid? You can take a BNF-style grammar and translate it directly into code without losing its fundamental structure!

If modern grammatical patterns still aren’t enough for you, then Damian Conway’s brilliant Regexp::Grammars module offers an even cleaner syntax, with superior debugging, too. Here’s the same code for parsing RFC 5322 recast into a pattern from that module:

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> \@ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*

    <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
    <token: quoted_pair>     \\ <.text>

    <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]

    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            \x0d \x0a
    <token: DQUOTE>          "
    <token: WSP>             [\x20\x09]

    }x;

};


while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

There’s a lot of good stuff in the perlre manpage, but these dramatic improvements in fundamental regex design features are by no means limited to Perl alone. Indeed the pcrepattern manpage may be an easier read, and covers the same territory.

Modern patterns have almost nothing in common with the primitive things you were taught in your finite automata class.

Joel