tags:

views:

72

answers:

3

I was reading an article put together by Martin Fowler regarding Composed Regular Expressions. This is where you might take code such as this:

const string pattern = @"^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)";

And break it out into something more like this:

protected override string GetPattern() {
      const string pattern =
        @"^score
        \s+  
        (\d+)          # points
        \s+
        for
        \s+
        (\d+)          # number of nights
        \s+
        night
        s?             #optional plural
        \s+
        at
        \s+
        (.*)           # hotel name
        ";

      return pattern;
    }
  }

Or this:

const string scoreKeyword = @"^score\s+";
const string numberOfPoints = @"(\d+)";
const string forKeyword = @"\s+for\s+";
const string numberOfNights = @"(\d+)";
const string nightsAtKeyword = @"\s+nights?\s+at\s+";
const string hotelName = @"(.*)";

const string pattern =  scoreKeyword + numberOfPoints +
  forKeyword + numberOfNights + nightsAtKeyword + hotelName;

Or even this:

const string space = @"\s+";
const string start = "^";
const string numberOfPoints = @"(\d+)";
const string numberOfNights = @"(\d+)";
const string nightsAtKeyword = @"nights?\s+at";
const string hotelName = @"(.*)";

const string pattern =  start + "score" + space + numberOfPoints + space +
  "for" + space + numberOfNights + space + nightsAtKeyword + 
   space + hotelName;

This sounds easy enough to do and might have some benefits to it. I am by no means a "regex guy" and frequently find myself searching the net to locate the regex that I need rather than taking the time to compose it for myself. Having found that just right regex for a given problem I copy and paste it in and test it to make sure it does as advertised. I then might drop a comment describing what it does. But then I am off to bigger and better things in life.

I am wondering if breaking a regex pattern up in the way described in Martin Fowlers discussion will actually make things easier to understand than a comment? At the end of the day you still have an ugly regex in your code only now it is in many pieces. If ever you need to extend that expression...how does this help you really understand what the regex is doing?

I know all the die-hard perl guys out there love their regex patterns. But for those that don't deal with a regex more than once every other project (question) do you find the breaking of the regex pattern into smaller bits more or less readable? Has anyone employed this concept in a project and found it useful? ...not useful?

+2  A: 

Yes, absolutely. Regexes are powerful, but because of their terse syntax, extremely unreadable. When I read a comment such as "this matches an URI", that doesn't actually help me figure out how it does that, and where I should look to (for example) fix a bug where it doesn't match some obscure corner case in query string properly. Regex is code; document it as you'd document a function. If it's short and (reasonably) clear, a single comment for the entire regex is fine. If it's complicated, clearly highlight and comment individual parts. If it's really complex, split it into several regexes.

Pavel Minaev
A: 

It is fairly easy to read if you can have extended syntax.

/^
  score   \s+ (\d+) \s+
  for     \s+ (\d+) \s+
  nights? \s+  at   \s+ (.*)
/x

I personally prefer Perl 6 style regex. I think they're easier to read.

rule pattern{
  score        $<score>= [ <.digits>+ ]
  for          $<nights>=[ <.digits>+ ]
  night[s]? at $<hotel>= [ .+ ]
}

After you perform a match against that rule $/ is associated with the matched text.

So something like this:

say "Hotel $/<hotel>";
say $/.perl;

Would output something like this

Hotel name of hotel
{
  'hotel'  => 'name of hotel',
  'nights' => 5,
  'score'  => 8
}
Brad Gilbert
A: 

I deal with this in PHP by using associative arrays and PHP's version of the tr function (I assume a similar data structure and function exists in any language).

The array looks like this:

$mappings = array ( 
  'a' => '[a-z0-9]',
  'd' => '[0-9]', 
  's' => '\s+', //and so on 
);

Then when I put them to use, it's just a matter of merging with the tr function. Mapped stuff gets converted, and unmapped stuff falls through:

 $regexp = strtr( $simplified_string, $mappings) ;

Bear in mind that this approach can just as easily overcomplicate things as it can simplify them. You're still writing out patterns, it's just that you've abstracted one pattern into another. Nevertheless, having these poor-man's character classes can be useful in outsourcing regexp's to devs or spec providers that don't speak the language.

rooskie