views:

1260

answers:

13

I have started to feel that using regular expressions decreases code maintainability. There is something evil about the terseness and power of regular expressions. Perl compounds this with side effect like default operators.

I DO have a habit of documenting regular expressions with at least one sentence giving the basic intent and at least one example of what would match.

Because regular expressions are built up I feel it is an absolute necessity to comment on the largest components of each element in the expression. Despite this even my own regular expressions have me scratching my head as though I am reading Klingon.

Do you intentionally dumb down your regular expressions? Do you decompose possibly shorter and more powerful ones into simpler steps? I have given up on nesting regular expressions. Are there regular expression constructs that you avoid due to mainainability issues?

Do not let this example cloud the question.

If the following by Michael Ash had some sort of bug in it would you have any prospects of doing anything but throwing it away entirely?

^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$

Per request the exact purpose can be found using Mr. Ash's link above.

Matches 01.1.02 | 11-30-2001 | 2/29/2000

Non-Matches 02/29/01 | 13/01/2002 | 11/00/02

A: 

I do not expect regular expressions to be readable, so I just leave them as they are, and rewrite if needed.

M. Utku ALTINKAYA
Don't you think you introduce bugs with this habit?
ojblass
It could pose a risk in some scenarios but not in others. For example I would feel safe doing this myself if complete unit tests were in place covering each and every use case for the regular expression. Without those tests, it would be scary for sure!
Adam Alexander
not at all, changing them is more risky than rewriting them int most cases, you have to understand whole scope. Unittest is a must whenever you use complex regular expressions
M. Utku ALTINKAYA
+1  A: 

I could still work with it. I'd just use Regulator. One thing it allows you to do is save the regex along with test data for it.

Of course, I might also add comments.


Here's what Expresso produced. I had never used it before, but now, Regulator is out of a job:

//  using System.Text.RegularExpressions;

/// 
///  Regular expression built for C# on: Thu, Apr 2, 2009, 12:51:56 AM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  Select from 3 alternatives
///      ^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$
///          Beginning of line or string
///          Match expression but don't capture it. [(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)]
///              Select from 2 alternatives
///                  (?:(?:0?[13578]|1[02])(\/|-|\.)31)\1
///                      Match expression but don't capture it. [(?:0?[13578]|1[02])(\/|-|\.)31]
///                          (?:0?[13578]|1[02])(\/|-|\.)31
///                              Match expression but don't capture it. [0?[13578]|1[02]]
///                                  Select from 2 alternatives
///                                      0?[13578]
///                                          0, zero or one repetitions
///                                          Any character in this class: [13578]
///                                      1[02]
///                                          1
///                                          Any character in this class: [02]
///                              [1]: A numbered capture group. [\/|-|\.]
///                                  Select from 3 alternatives
///                                      Literal /
///                                      -
///                                      Literal .
///                              31
///                      Backreference to capture number: 1
///                  (?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2)
///                      Return
///                      New line
///                      Match expression but don't capture it. [(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2]
///                          (?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2
///                              Match expression but don't capture it. [0?[13-9]|1[0-2]]
///                                  Select from 2 alternatives
///                                      0?[13-9]
///                                          0, zero or one repetitions
///                                          Any character in this class: [13-9]
///                                      1[0-2]
///                                          1
///                                          Any character in this class: [0-2]
///                              [2]: A numbered capture group. [\/|-|\.]
///                                  Select from 3 alternatives
///                                      Literal /
///                                      -
///                                      Literal .
///                              Match expression but don't capture it. [29|30]
///                                  Select from 2 alternatives
///                                      29
///                                          29
///                                      30
///                                          30
///                              Backreference to capture number: 2
///          Return
///          New line
///          Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
///              (?:1[6-9]|[2-9]\d)?\d{2}
///                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                      Select from 2 alternatives
///                          1[6-9]
///                              1
///                              Any character in this class: [6-9]
///                          [2-9]\d
///                              Any character in this class: [2-9]
///                              Any digit
///                  Any digit, exactly 2 repetitions
///          End of line or string
///      ^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$
///          Beginning of line or string
///          Match expression but don't capture it. [0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))]
///              0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)))
///                  0, zero or one repetitions2
///                  [3]: A numbered capture group. [\/|-|\.]
///                      Select from 3 alternatives
///                          Literal /
///                          -
///                          Literal .
///                  29
///                  Backreference to capture number: 3
///                  Match expression but don't capture it. [(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))]
///                      Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00)]
///                          Select from 2 alternatives
///                              (?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])
///                                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                                      Select from 2 alternatives
///                                          1[6-9]
///                                              1
///                                              Any character in this class: [6-9]
///                                          [2-9]\d
///                                              Any character in this class: [2-9]
///                                              Any digit
///                                  Match expression but don't capture it. [0[48]|[2468][048]|[13579][26]]
///                                      Select from 3 alternatives
///                                          0[48]
///                                              0
///                                              Any character in this class: [48]
///                                          [2468][048]
///                                              Any character in this class: [2468]
///                                              Any character in this class: [048]
///                                          [13579][26]
///                                              Any character in this class: [13579]
///                                              Any character in this class: [26]
///                              (?:(?:16|[2468][048]|[3579][26])00)
///                                  Return
///                                  New line
///                                  Match expression but don't capture it. [(?:16|[2468][048]|[3579][26])00]
///                                      (?:16|[2468][048]|[3579][26])00
///                                          Match expression but don't capture it. [16|[2468][048]|[3579][26]]
///                                              Select from 3 alternatives
///                                                  16
///                                                      16
///                                                  [2468][048]
///                                                      Any character in this class: [2468]
///                                                      Any character in this class: [048]
///                                                  [3579][26]
///                                                      Any character in this class: [3579]
///                                                      Any character in this class: [26]
///                                          00
///          End of line or string
///      ^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$
///          Beginning of line or string
///          Match expression but don't capture it. [(?:0?[1-9])|(?:1[0-2])]
///              Select from 2 alternatives
///                  Match expression but don't capture it. [0?[1-9]]
///                      0?[1-9]
///                          0, zero or one repetitions
///                          Any character in this class: [1-9]
///                  Match expression but don't capture it. [1[0-2]]
///                      1[0-2]
///                          1
///                          Any character in this class: [0-2]
///          Return
///          New line
///          [4]: A numbered capture group. [\/|-|\.]
///              Select from 3 alternatives
///                  Literal /
///                  -
///                  Literal .
///          Match expression but don't capture it. [0?[1-9]|1\d|2[0-8]]
///              Select from 3 alternatives
///                  0?[1-9]
///                      0, zero or one repetitions
///                      Any character in this class: [1-9]
///                  1\d
///                      1
///                      Any digit
///                  2[0-8]
///                      2
///                      Any character in this class: [0-8]
///          Backreference to capture number: 4
///          Match expression but don't capture it. [(?:1[6-9]|[2-9]\d)?\d{2}]
///              (?:1[6-9]|[2-9]\d)?\d{2}
///                  Match expression but don't capture it. [1[6-9]|[2-9]\d], zero or one repetitions
///                      Select from 2 alternatives
///                          1[6-9]
///                              1
///                              Any character in this class: [6-9]
///                          [2-9]\d
///                              Any character in this class: [2-9]
///                              Any digit
///                  Any digit, exactly 2 repetitions
///          End of line or string
///  
///
/// 
public static Regex regex = new Regex(
      "^(?:(?:(?:0?[13578]|1[02])(\\/|-|\\.)31)\\1|\r\n(?:(?:0?[13-9]"+
      "|1[0-2])(\\/|-|\\.)(?:29|30)\\2))\r\n(?:(?:1[6-9]|[2-9]\\d)?\\d"+
      "{2})$|^(?:0?2(\\/|-|\\.)29\\3(?:(?:(?:1[6-9]|[2-9]\\d)?(?:0["+
      "48]|[2468][048]|[13579][26])|\r\n(?:(?:16|[2468][048]|[3579][2"+
      "6])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))\r\n(\\/|-|\\.)(?:0?[1-9"+
      "]|1\\d|2[0-8])\\4(?:(?:1[6-9]|[2-9]\\d)?\\d{2})$",
    RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );

John Saunders
It would be interesting to see what this tool comes up with for the example RE in the question. I was going to try but I couldn't be bothered downloading and installing. Do you want to give it a shot?
paxdiablo
I tried real quick. It doesn't like it much, at least not in the analyzer. It parsed ok. What's it meant to match?
John Saunders
exacly you have a bug in it... what do you do?!?!?!
ojblass
You wait until the Morning, then start cutting pieces out. Regulator, at least highlights matching parentheses.
John Saunders
Follow the name link in the question for an explanation of what it does.
ojblass
Now if only we could get Espresso to output "Hmm. I think this is checking for valid US-style dates between the years 1600 and 9999". Now that's a product I would buy :-)
paxdiablo
I would honstly not really be able to handle the output from this tool with a straight face and a tight deadline.
ojblass
+1  A: 

I think the answer to maintaining regular expression is not so much with commenting or regex constructs.

If I were tasked with debugging the example you gave, I would sit down infront of a regex debug tool (like Regex Coach) and step through the regular expression on the data that it is has to process.

Cannonade
It would be interesting to see what this tool comes up with for the example RE in the question. I was going to try but I couldn't be bothered downloading and installing. Do you want to give it a shot?
paxdiablo
Still so much evil all in one line of code... I am sure having this in my code is simply not worth it.
ojblass
I'll give it a shot Pax (nice answer by the way, I prefer yours to mine ;) ). I guess I will need some test data though. ojblass, can you post some in the question?
Cannonade
posted per your request
ojblass
Ok stepped through this guy in Regex Coach on the test data ojblass posted (thanks). I found stepping through it helped my understanding a great deal. That said, I would almost certainly never use this thing to validate dates in live code. Faster and clearer to write specialised code to do it.
Cannonade
+6  A: 

Some people use REs for the wrong things (I'm waiting for the first SO question on how to detect a valid C++ program using a single RE).

I usually find that, if I can't fit my RE within 60 characters, it's better off being a piece of code since that will almost always be more readable.

In any case, I always document, in the code, what the RE is supposed to achieve, in great detail. This is because I know, from bitter experience, how hard it is for someone else (or even me, six months later) to come in and try to understand.

I don't believe they're evil, although I do believe some people who use them are evil (not looking at you, Michael Ash :-). They're a great tool but, like a chainsaw, you'll cut your legs off if you don't know how to use them properly.

UPDATE: Actually, I've just followed the link to that monstrosity, and it's to validate m/d/y format dates between the years 1600 and 9999. That is a classic case of where full-blown code would be more readable and maintainable.

You just split it up into three fields and check the individual values. I'd almost consider it an offense worthy of termination if one of my minions bought this to me. I'd certainly send them back to write it properly.

paxdiablo
Even then, most platforms provide functions which can convert a date for you. Really, use those!
strager
Please do not let the particular example cloud the question.
ojblass
I far prefer a morass of string manipulation functions to a regex. I can learn what the morass does in the debugger. A regex is just a black box.
jmucchiello
"How to detect a valid C++ program" - is that technically possible?
Justice
@Justice - With classic regular expressions, no. C++ is not a regular language, and (classic) regular expressions cannot properly parse nested parenthesis and such. However, with all the PCRE extensions/hacks, it MIGHT be possible, though perhaps not a good idea.
Chris Lutz
+14  A: 

I usually just try to wrap all my Regular Expression calls inside their own function, with a meaningful name and an some basic comments. I like to think of Regular Expressions as a write only language, readable only by the one that wrote it (Unless it's really simple). I fully expect that someone would need to probably completely re-write the expression if they had to change its intent and this is probably for the better to keep the Regular Expression training alive.

James
Encapsulate what it does with functional meaning. That is one good practice I have not employed.
ojblass
Yeah, its worked out quite well for many of the larger projects I've been involved with.
James
Actually, that is a good approach. Properly named and documented, you could then drop in a non-RE option if the RE one becomes unmaintainable. I don't agree that they should be write only, but +1 anyway.
paxdiablo
Good tools and inline comments have to take second place to a good practice when selecting the correct answer.
ojblass
+22  A: 

Use Expresso which gives a hierarchical, english breakdown of a regex.

Or

This tip from Darren Neimke:

.NET allows regular expression patterns to be authored with embedded comments via the RegExOptions.IgnorePatternWhitespace compiler option and the (?#...) syntax embedded within each line of the pattern string.

This allows for psuedo-code-like comments to be embedded in each line and has the following affect on readability:

Dim re As New Regex ( _
    "(?<=       (?# Start a positive lookBEHIND assertion ) " & _
    "(#|@)      (?# Find a # or a @ symbol ) " & _
    ")          (?# End the lookBEHIND assertion ) " & _
    "(?=        (?# Start a positive lookAHEAD assertion ) " & _
    "   \w+     (?# Find at least one word character ) " & _
    ")          (?# End the lookAHEAD assertion ) " & _
    "\w+\b      (?# Match multiple word characters leading up to a word boundary)", _
    RegexOptions.Multiline Or RegexOptions.IgnoreCase Or RegexOptions.IgnoreWhitespace _
)

Here's another .NET example (requires the RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace options):

static string validEmail = @"\b    # Find a word boundary
                (?<Username>       # Begin group: Username
                [a-zA-Z0-9._%+-]+  #   Characters allowed in username, 1 or more
                )                  # End group: Username
                @                  # The e-mail '@' character
                (?<Domainname>     # Begin group: Domain name
                [a-zA-Z0-9.-]+     #   Domain name(s), we include a dot so that
                                   #   mail.somewhere is also possible
                .[a-zA-Z]{2,4}     #   The top level domain can only be 4 characters
                                   #   So .info works, .telephone doesn't.
                )                  # End group: Domain name
                \b                 # Ending on a word boundary
                ";

If your RegEx is applicable to a common problem, another option is to document it and submit to RegExLib, where it will be rated and commented upon. Nothing beats many pairs of eyes...

Another RegEx tool is The Regulator

Mitch Wheat
inline commenting... good tip
ojblass
It would be interesting to see what this tool comes up with for the example RE in the question. I was going to try but I couldn't be bothered downloading and installing. Do you want to give it a shot?
paxdiablo
Takes seconds to download and install...
Mitch Wheat
My apathy far outweighs my interest :-)
paxdiablo
Expresso did very nicely with this. Thanks! I hadn't known about it, and it looks better than Regulator.
John Saunders
Very nice... good tools to debug them... I have been living in the dark ages for a while.
ojblass
(#: ... ) comments are also in Perl, and probably therefore also in PCRE-engines in most languages.
Chris Lutz
+15  A: 

Well, the entire purpose in life of the PCRE /x modifier is to allow you to write regexes more readably, as in this trivial example:

my $expr = qr/
    [a-z]    # match a lower-case letter
    \d{3,5}  # followed by 3-5 digits
/x;
chaos
I am having trouble finding documentation on this... maybe I am not using the right terms or right source. Can you help?
ojblass
http://perldoc.perl.org/perlre.html, about a page down, paragraph starting with "The /x modifier itself needs a little more explanation". There are also examples scattered through the rest of the page.
chaos
(That's aka 'man perlre' of course.)
chaos
In Perl this would be the suggested way to go about deailng with legibility concerns. Specifically /x modifier tells Perl to ignore all whitespace (regex must use \s to specify spaces in the search) as well as permitting the '#' character to behave like a normal comment.
Danny
Also works with PHP PCRE functions
gnarf
+3  A: 

I have learned to avoid all but the simplest regexp. I far prefer other models such as Icon's string scanning or Haskell's parsing combinators. In both of these models you can write user-defined code that has the same privileges and status as the built-in string ops. If I were programming in Perl I would probably rig up some parsing combinators in Perl---I've done it for other languages.

A very nice alternative is to use Parsing Expression Grammars as Roberto Ierusalimschy has done with his LPEG package, but unlike parser combinators this is something you can't whip up in an afternoon. But if somebody has already done PEGs for your platform it's a very nice alternative to regular expressions.

Norman Ramsey
I was composing an answer when yours came in; I threw my out and upvoted yours instead.
MarkusQ
Every time I parse anything but the simplest strings I make mistakes.
ojblass
@objlass: This is why John Levine hates handwritten parsers.
Norman Ramsey
Maybe programmers should examine the difficulty of parsing using the Chomsky hierarchy and choose an appropriate parser from there! Really, it is irritating to see people use back-references (which are context-sensitive) to compensate for things they could easily implement in a context-free grammar.
Gracenotes
+4  A: 

I have found a nice method is to simply break up the matching process into several phases. It probably does not execute as fast but you have the added bonus of also being able to tell at a finer grain level why the match is not occurring.

Another route is to use LL or LR parsing. Some languages are not expressible as regular expressions probably even with perl's non-fsm extensions.

fuzzy-waffle
The literature on your statement will require some study.
ojblass
Being able to match at a finer grain level is essential.
ojblass
At this point Perl can parse just about anything with its regexes (they are now Turing Machine compatible), but they are not the best language for parsing complex structures. Parse::RecDescent is much better.
Chas. Owens
Parse::RecDescent is some pretty amazing/insane perl code. I have not had much success with it performance wise. My favorite is antlr which uses a variable look ahead using LL(k) instead of LL(1).
fuzzy-waffle
+2  A: 

Wow, that is ugly. It looks like it should work, modulo an unavoidable bug dealing with 00 as a two digit year (it should be a leap year one quarter of the time, but without the century you have no way of knowing what it should be). There is a lot of redundancy that should probably be factored out into sub-regexes and I would create three sub-regexes for the three main cases (that is my next project tonight). I also used a different character for the delimiter to avoid having to escape forward slashes, changed the single character alternations into character classes (which happily lets us avoid having to escape period), and changed \d to [0-9] since the former matches any digit character (including U+1815 MONGOLIAN DIGIT FIVE: ᠕) in Perl 5.8 and 5.10.

Warning, untested code:

#!/usr/bin/perl

use strict;
use warnings;

my $match_date = qr{
    #match 29th - 31st of all months but 2 for the years 1600 - 9999
    #with optionally leaving off the first two digits of the year
    ^
    (?: 
     #match the 31st of 1, 3, 5, 7, 8, 10, and 12
     (?: (?: 0? [13578] | 1[02] ) ([/-.]) 31) \1
     |
     #or match the 29th and 30th of all months but 2
     (?: (?: 0? [13-9] | 1[0-2] ) ([/-.]) (?:29|30) \2)
    )
    (?:
     (?:                      #optionally match the century
      1[6-9] |         #16 - 19
      [2-9][0-9]       #20 - 99
     )?
     [0-9]{2}                 #match the decade
    )
    $
    |
    #or match 29 for 2 for leap years
    ^
    (?:
    #FIXME: 00 is treated as a non leap year 
    #even though 2000, 2400, etc are leap years
     0?2                      #month 2
     ([/-.])                  #separtor
     29                       #29th
     \3                       #separator from before
     (?:                      #leap years
      (?:
       #match rule 1 (div 4) minus rule 2 (div 100)
       (?: #match any century
        1[6-9] |
        [2-9][0-9]
       )?
       (?: #match decades divisible by 4 but not 100
        0[48]       | 
        [2468][048] |
        [13579][26]
       )
       |
       #or match rule 3 (div 400)
       (?:
        (?: #match centuries that are divisible by 4
         16          | 
         [2468][048] |
         [3579][26]
        )
        00
       )
      )
     )
    )
    $
    |
    #or match 1st through 28th for all months between 1600 and 9999
    ^
    (?: (?: 0?[1-9]) | (?:1[0-2] ) ) #all months
    ([/-.])                          #separator
    (?: 
     0?[1-9] |                #1st -  9th  or
     1[0-9]  |                #10th - 19th or
     2[0-8]                   #20th - 28th
    )
    \4                               #seprator from before
    (?:                              
     (?:                      #optionally match the century
      1[6-9] |         #16 - 19
      [2-9][0-9]       #20 - 99
     )?
     [0-9]{2}                 #match the decade
    )
    $
}x;
Chas. Owens
Even in this form I would still not feel comfortable with the code.
ojblass
When I was a child, my mother would tell me, "take from the edge, and blow". What goes for soup goes double for regexes. Take them a bit of a time, slow down, and savor the pieces. Before long, the bowl will be empty. ;-)
John Saunders
No, I wouldn't use it either. 70 lines (3 pages) of code that could be replace by a single call to a library function, no thank you. I just wanted to see how it worked. And now I am obsessed with making it simpler.
Chas. Owens
+3  A: 

Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. — Jamie Zawinski in comp.lang.emacs.

Keep the regular expressions as simple as they can possibly be (KISS). In your date example, I'd likely use one regular expression for each date-type.

Or even better, replaced it with a library (i.e. a date-parsing library).

I'd also take steps to ensure that the input source had some restrictions (i.e. only one type of date-strings, ideally ISO-8601).

Also,

  • One thing at the time (with the possible exception of extracting values)
  • Advanced constructs are ok if used correctly (as in simplying the expression and hence reducing maintenance)

EDIT:

"advanced constructs lead to maintainance issues"

My original point was that if used correctly it should lead to simpler expressions, not more difficult ones. Simpler expressions should reduce maintenance.

I've updated the text above to say as much.

I would point out that regular expressions hardly qualify as advanced constructs in and of themselves. Not being familiar with a certain construct does not make it an advanced construct, merely an unfamiliar one. Which does not change the fact that regular expressions are powerful, compact and- if used properly- elegant. Much like a scalpel, it lies entirely in the hands of the one who wields it.

one problem becomes two... but I have to argue that advanced constructs lead to maintainance issues...
ojblass
Of all the peices of code I look at regular expressions, synchronization, and operater overloading in that order are the hardest problems to get to root cause when a defect is found. They may not be advanced but they are trucks with a lot of power for the amount of space they occupy.
ojblass
+2  A: 

Here is the same regex broken down into digestible pieces. In addition to being more readable, some of the sub-regexes can be useful on their own. It is also significantly easier to change the allowed separators.

#!/usr/local/ActivePerl-5.10/bin/perl

use 5.010; #only 5.10 and above
use strict;
use warnings;

my $sep         = qr{ [/.-] }x;               #allowed separators    
my $any_century = qr/ 1[6-9] | [2-9][0-9] /x; #match the century 
my $any_decade  = qr/ [0-9]{2} /x;            #match any decade or 2 digit year
my $any_year    = qr/ $any_century? $any_decade /x; #match a 2 or 4 digit year

#match the 1st through 28th for any month of any year
my $start_of_month = qr/
    (?:                         #match
        0?[1-9] |               #Jan - Sep or
        1[0-2]                  #Oct - Dec
    )
    ($sep)                      #the separator
    (?: 
        0?[1-9] |               # 1st -  9th or
        1[0-9]  |               #10th - 19th or
        2[0-8]                  #20th - 28th
    )
    \g{-1}                      #and the separator again
/x;

#match 28th - 31st for any month but Feb for any year
my $end_of_month = qr/
    (?:
        (?: 0?[13578] | 1[02] ) #match Jan, Mar, May, Jul, Aug, Oct, Dec
        ($sep)                  #the separator
        31                      #the 31st
        \g{-1}                  #and the separator again
        |                       #or
        (?: 0?[13-9] | 1[0-2] ) #match all months but Feb
        ($sep)                  #the separator
        (?:29|30)               #the 29th or the 30th
        \g{-1}                  #and the separator again
    )
/x;

#match any non-leap year date and the first part of Feb in leap years
my $non_leap_year = qr/ (?: $start_of_month | $end_of_month ) $any_year/x;

#match 29th of Feb in leap years
#BUG: 00 is treated as a non leap year
#even though 2000, 2400, etc are leap years
my $feb_in_leap = qr/
    0?2                         #match Feb
    ($sep)                      #the separtor
    29                          #the 29th
    \g{-1}                      #the separator again
    (?:
        $any_century?           #any century
        (?:                     #and decades divisible by 4 but not 100
            0[48]       | 
            [2468][048] |
            [13579][26]
        )
        |
        (?:                     #or match centuries that are divisible by 4
            16          | 
            [2468][048] |
            [3579][26]
        )
        00                      
    )
/x;

my $any_date  = qr/$non_leap_year|$feb_in_leap/;
my $only_date = qr/^$any_date$/;

say "test against garbage";
for my $date (qw(022900 foo 1/1/1)) {
    say "\t$date ", $date ~~ $only_date ? "matched" : "didn't match";
}
say '';

#comprehensive test

my @code = qw/good unmatch month day year leap/;
for my $sep (qw( / - . )) {
    say "testing $sep";
    my $i  = 0;
    for my $y ("00" .. "99", 1600 .. 9999) {
        say "\t", int $i/8500*100, "% done" if $i++ and not $i % 850;
        for my $m ("00" .. "09", 0 .. 13) {
            for my $d ("00" .. "09", 1 .. 31) {
                my $date = join $sep, $m, $d, $y;
                my $re   = $date ~~ $only_date || 0;
                my $code = not_valid($date);
                unless ($re == !$code) {
                    die "error $date re $re code $code[$code]\n"
                }
            }
        }
    }
}

sub not_valid {
    state $end = [undef, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31];
    my $date      = shift;
    my ($m,$d,$y) = $date =~ m{([0-9]+)[-./]([0-9]+)[-./]([0-9]+)};
    return 1 unless defined $m; #if $m is set, the rest will be too
    #components are in roughly the right ranges
    return 2 unless $m >= 1 and $m <= 12;
    return 3 unless $d >= 1 and $d <= $end->[$m];
    return 4 unless ($y >= 0 and $y <= 99) or ($y >= 1600 and $y <= 9999);
    #handle the non leap year case
    return 5 if $m == 2 and $d == 29 and not leap_year($y);

    return 0;
}

sub leap_year {
    my $y    = shift;
    $y = "19$y" if $y < 1600;
    return 1 if 0 == $y % 4 and 0 != $y % 100 or 0 == $y % 400;
    return 0;
}
Chas. Owens
A: 

I posted a question recently about commenting regexes with embedded comments There were useful answers and particularly one from @mikej

See the post by Martin Fowler on ComposedRegex for some more ideas on improving regexp readability. In summary, he advocates breaking down a complex regexp into smaller parts which can be given meaningful variable names. e.g.

peter.murray.rust