views:

714

answers:

16

I've just seen a huge regex for Java that made me think a little about maintainability of regular expressions in general. I believe that most people - except some badass perl mongers - would agree that regular expressions are hardly maintainable.

I was thinking about how this situation could be fixed. So far, the most promising idea I have is using a fluent interface. To give an example, instead of:

Pattern pattern = Pattern.compile("a*|b{2,5}");

one could write something like this

import static util.PatternBuilder.*

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

Pattern alternative = 
  or(
    string("a").anyTimes(),
    string("b").times(2,5)
  )
  .compile();

In this very short example, the common way of creating regular expression is still quite readable to any mediocre talented developer. However, think about those spooky expressions that fill two or more lines with 80 characters each. Sure, the (verbose) fluent interface would require several lines instead of only two, but I'm sure it would be much more readable (hence maintainable).

Now my questions:

  1. Do you know of any similar approach to regular expressions?

  2. Do you agree that this approach could be better than using simple strings?

  3. How would you design the API?

  4. Would you use such a neat utility in your projects?

  5. Do you think this would be fun implementing? ;)

EDIT: Imagine that there could be methods that are on a higher level than simple constructs we all no from regex, e.g.

// matches [email protected] - think of it as reusable expressions
Pattern p = string{"a").anyTimes().string("b@").domain().compile();

EDIT - short summary of comments:

It's interesting to read that most people think that regular expressions are here to stay - although it takes tools to read them and smart guys to think of ways to make them maintainable. While I'm not sure that a fluent interface is the best way to go, I'm sure that some smart engineers - we? ;) - should spend some time to make regular expressions a thing of the past - it's enough that they've been with us for 50 years know, don't you think?

OPEN BOUNTY

The bounty will be awarded to the best idea (no code required) for a new approach to regular expressions.

EDIT - A NICE EXAMPLE:

here's the kind of pattern I'm talking about - extra kudos to the first one who's able to translate it - RegexBuddies allowed (it's from an Apache project btw) extra kudos awarded to chii and mez: it's a RFC compliant email address validation pattern - although its RFC822 (see ex-parrot.com), not 5322 - not sure if there is a difference though - if yes, I'll award next extra kudos for a patch to fit 5322 ;)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";
+6  A: 

There's a fluent regexps library for .NET.

elder_george
thanks, looks promising
sfussenegger
+1. I wonder, though if it wouldn't get unreadable when you start getting into capturing groups, etc.
TrueWill
Oh lordy that code makes my eyes water. :(
Juliet
+1  A: 

Do you know of any similar approach to regular expressions?

No, except for the previous answer

Do you agree that this approach could be better than using simple strings?

Sort of - I think instead of single characters to represent constructs, we could use more descriptive markup, but I doubt it would make complex logic any clearer.

How would you design the API?

Translate each construct into a method name, and allow nested function calls so that it's very easy to take a string and substitute method names into it.

I think the majority of the value will be in defining a robust library of utility functions, like matching "emails", "phone numbers", "lines that don't contain X", etc. that can be customized.

Would you use such a neat utility in your projects?

Maybe - but probably only for the longer ones, where it would be easier to debug function calls than debugging string editing, or where there is a nice utility function ready to use.

Do you think this would be fun implementing? ;)

Of course!

Jeff Meatball Yang
+16  A: 

It might be slightly easier for someone without any regex experience, but after someone learns your system, he still won't be able to read a normal regex elsewhere.

Also, I think your version is harder to read for a regex expert.

I would recommend annotating the regex like this:

Pattern pattern = Pattern.compile(
  "a*     # Find 0 or more a        \n" +
  "|      # ... or ...              \n" +
  "b{2,5} # Find between 2 and 5 b  \n",
Pattern.COMMENTS);

This is easy to read for any experience level, and for the inexperienced, it teaches regex at the same time. Also, the comments can be tailored to the situation to explain the business rules behind the regex rather than just the structure.

Also, a tool like RegexBuddy can take your regex and translate it into:

Match either the regular expression below (attempting the next alternative only if this one fails) «a*»
   Match the character "a" literally «a*»
      Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «b{2,5}»
   Match the character "b" literally «b{2,5}»
      Between 2 and 5 times, as many times as possible, giving back as needed (greedy) «{2,5}»
Jeremy Stein
Good point. However, I'd love to consider regular expressions a thing of the past - so no need to learn it or teach anybody about it. Just make it obsolete ... please :)Your comment inspired me to add an "EDIT:". Wouldn't it be nice to read expressions not construct by construct but semantically. Normally you don't care as much about the used constructs as you do about the semantics of a regex, right?
sfussenegger
+1 for RegexBuddy. The ability to turn an expression into English is very helpful.
TrueWill
No chance of regexes becoming a thing of the past any time soon. They are far too useful for text parsing and there is no good alternative.
Chris Nava
I think that's what lots of devs thought of other things just before they became a thing of the past ;) However, you're right that there's no good alternative yet (at least no widely known one). Though that's no reason to except it - to me it's a reason to question it even more :)
sfussenegger
Also, keep in mind that regular expressions can be combined. The example of the RFC822 email address regular expression is likely a case in point. That kind of regexp is generated programmatically from smaller pieces. The smaller pieces are intelligible, but the whole may not be.
Bob Aman
+12  A: 

Martin Fowler suggest another strategy. Namely taking meaningful parts of the regex and replace them by variables. He uses following example:

 "^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)"

becomes

   String scoreKeyword = "^score\s+";
   String numberOfPoints = "(\d+)";
   String forKeyword = "\s+for\s+";
   String numberOfNights = "(\d+)";
   String nightsAtKeyword = "\s+nights?\s+at\s+";
   String hotelName = "(.*)";

   String pattern = scoreKeyword + numberOfPoints +
      forKeyword + numberOfNights + nightsAtKeyword + hotelName;

Which is much more readable and maintainable.

The goal of the previous example was to parse numberOfPoints, numberOfNights and hotelName out of a List of Strings like:

score 400 for 2 nights at Minas Tirith Airport
Dominik
That's a useful technique for complicated regexps, but I don't necessarily see any value in replacing e.g. `\s+for\s+` with a variable...
harto
+5  A: 

This is an intriguing concept, but as it is presented there are a few flaws.

But first answers to the key questions asked:

Now my questions:

1. Do you know of any similar approach to regular expressions?

None that haven't been mentioned already. And those I found out about by reading the question and answers.

2. Do you agree that this approach could be better than using simple strings?

If it works as advertised, it would definitely makes things much easier to debug.

3. How would you design the API?

See my notes in the next section. I take your examples and the linked .NET library as a starting point.

4. Would you use such a neat utility in your projects?

Undecided. I have no problem working with the current version of cryptic regularly expressions. And I would need a tool to convert existing regular expressions to the fluent language version.

5. Do you think this would be fun implementing? ;)

I'd enjoy working out the higher level how, than writing the actual code. Which explains the wall of text that is this answer.


Here are a couple of problems I noticed, and the way I would deal with it.

Unclear structure.

Your example seems to create a regex by concatenating on to strings. This is not very robust. I believe that these methods should be added to String and Patern/Regex objects, because it will make implementation and code cleaner. Besides it's similar to the way Regular Expressions are classically defined.

Just because I can't see it working any other way, the rest of my annotations to the proposed scheme will assume that all methods act on and return Pattern objects.

Edit: I seem to have used the following conventions throughout. So I've clarified them and move them up here.

  • Instance methods: pattern augmentation. eg: capturing, repetition, look around assertions.

  • Operators: order of operations. alternation, concatenation

  • Constants: character classes, boundaries (inplace of \w, $, \b, etc)

How will capturing/clustering be handled?

Capturing is a huge part of regular expressions.

I see each Pattern object being stored internally as a cluster. (?: pattern) in Perl terms. Allowing for pattern tokens to easily be mixed and mingled without interfering with other pieces.

I expect capturing to be done as an instance method on a Pattern. Taking a variable to store the matching string[s] in.

pattern.capture(variable) would store pattern in variable. In the event that the capture is part of an expression to be matched multiple times, variable should contain an array of strings of all matches for pattern.

Fluent languages can be very ambiguous.

Fluent languages are not well suited to the recursive nature of regular expressions. So consideration needs to be given to the order of operations. Just chaining methods together does not allow for very complex regular expressions. Exactly the situation where such a tool would be useful.

Does

Pattern pattern = string("a").anyTimes().or().string("b").times(2,5).compile();

produce /a*|b{2,5}/ or /(a*|b){2,5}/ ?

How would such a scheme handle nested alternation? Eg: /a*|b(c|d)|e/

I see three ways of handling alternation in regular expressions

  1. As an operator: pattern1 or pattern2 => pattern # /pattern1|pattern2/
  2. As a class method: Pattern.or( pattern1, pattern2[, pattern3]*) => pattern # /pattern1|patern2|patern3|...|/
  3. As a instance method: pattern1.or(pattern2) => pattern # /pattern1|patern2/

I would handle concatenation the same way.

  1. As an operator: pattern1 + pattern2 => pattern # /pattern1pattern2/
  2. As a class method: Pattern.concatenate( pattern1, pattern2[, pattern3]*) => pattern # /pattern1patern2patern3.../
  3. As a instance method: pattern1.then(pattern2) => pattern # /pattern1patern2/

How to extend for commonly used patterns

The proposed scheme used .domain() which seems to be a common regex. To treat user defined patterns as a methods doesn't make it easy to add new patterns. In a language like Java users of the library would have to override the class to add methods for commonly used patterns.

This is addressed by my suggestion to treat every piece as an object. A pattern object can be created for each commonly used regex like matching a domain for example. Given my previous thoughts about capturing it wouldn't too be hard to ensure that capturing works for multiple copies of the same common pattern that contains a captured section.

There should also be constants for patterns matching various character classes.

Zero width look around assertions

Expanding on my thoughts that all pieces should be implicitly clustered. Look around assertions shouldn't be too hard too do with an instance method.

pattern.zeroWidthLookBehind() would produce (?<patten).


Things that still need to be considered.

  • Backreferences: Hopefully not too hard with the named capturing discussed earlier
  • How to actually implement it. I haven't given the internals too much thought. It's where the real magic is going to happen.
  • Translation: There really should be a tool translate to and from classic regexs (say Perl dialect) and the new scheme. Translating from the new scheme could be a part of the package


Putting it all together, my proposed version of a pattern that matches an email address:

Pattern domain_label = LETTER_CHARACTER + (LETTER_CHARACTER or "-" or DIGIT_CHARACTER).anyTimes()
Pattern domain = domain_label + ("." + domain_label).anyTimes()
Pattern pattern = (LETTER_CHARACTER + ALPHANUMERIC_CHARACTER + "@" + domain).compile

In hindsight, my scheme borrows heavily from Martin Fowler's approach in use. Although I didn't intend for things to go this way, it definitely makes using such a system more maintainable. It also addresses a problem or two with Fowler's approach (capturing order).

EmFi
I quickly scanned through your answer (no time do read it thoroughly now) and it sounds nice. One comment though: the second line of your example should contain `domain_label` rather than `domain` (`domain_label + ("." + domain_label).anyTimes()`)
sfussenegger
That's one of the hazards of posting such a response at 5 am. I've corrected it.
EmFi
+1  A: 

In answer to the last part of the question (for Kudos)

private static final String pattern = "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:"
    + "\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:("
    + "?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ "
    + "\\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\0"
    + "31]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\"
    + "](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+"
    + "(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:"
    + "(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)"
    + "?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\"
    + "r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?["
    + " \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)"
    + "?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t]"
    + ")*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?["
    + " \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*"
    + ")(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t]"
    + ")+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)"
    + "*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+"
    + "|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r"
    + "\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:"
    + "\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t"
    + "]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031"
    + "]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\]("
    + "?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?"
    + ":(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?"
    + ":\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?"
    + ":(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?"
    + "[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|"
    + "\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>"
    + "@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\""
    + "(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?"
    + ":[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\["
    + "\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\".\\[\\] \\000-"
    + "\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|("
    + "?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;"
    + ":\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[(["
    + "^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\""
    + ".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\"
    + "]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\"
    + "[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\"
    + "r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] "
    + "\\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]"
    + "|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\".\\[\\] \\0"
    + "00-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\"
    + ".|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,"
    + ";:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\"(?"
    + ":[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:["
    + "^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]"
    + "]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*("
    + "?:(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:("
    + "?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=["
    + "\\[\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t"
    + "])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t"
    + "])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?"
    + ":\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|"
    + "\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:"
    + "[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\"
    + "]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)"
    + "?[ \\t])*(?:@(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\""
    + "()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)"
    + "?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>"
    + "@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?["
    + " \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,"
    + ";:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t]"
    + ")*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\"
    + "\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?"
    + "(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\"."
    + "\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:"
    + "\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\["
    + "\"()<>@,;:\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])"
    + "*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])"
    + "+|\\Z|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\"
    + ".(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z"
    + "|(?=[\\[\"()<>@,;:\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:("
    + "?:\\r\\n)?[ \\t])*))*)?;\\s*)";

matches RFC compliant email addresses :D

Mez
+1 extra kudos ;) (from my edit to the question: "its RFC822 (see ex-parrot.com), not 5322 - not sure if there is a difference though - if yes, I'll award next extra kudos for a patch to fit 5322 ;)")
sfussenegger
That regular expression appeared in the first edition of Friedl's Mastering Regular Expression book as an example of when regular expressions are not a good idea. An important problem is the matching nested parentheses etc, which can be solved with .NET expressions in a much easier way using balanced groups.
Abel
+1  A: 

A regular expression is a description of a finite state machine. The classic textual representation isn't necessarily bad. It is compact, it is relatively unambigous and it is fairly well adopted.

It MAY be that a better representation would be a state transition diagram, but that would probably be hard to use in source code.

One possibility would be to build it from a variety of container and combiner objects.

Something along the line of the following (turning this from pseudo-code to language of choice is left as an exercise for the eager):

domainlabel = oneormore(characterclass("a-zA-Z0-9-"))
separator = literal(".")
domain = sequence(oneormore(sequence(domainlabel, separator)), domainlabel)
localpart = oneormore(characterclassnot("@"))
emailaddress = sequence(localpart, literal("@"), domain)

Note that the above will incorrectly classify arbritarily many email addresses as being valid that do NOT conform to the grammar they're required to follow, as that grammar requires more than a simple FSM for full parsing. I don't believe it'd misclassify a valid address as invalid, though.

It should correspond to [^@]+@([a-zA-Z0-9-]+.)+.([a-zA-Z0-9-]+)

Vatine
you're right, calling regular expressions "a Bad Thing" was little bit harsh - or let's say "provocative" :) Your suggestion looks quite nice. And your mention of "finite state machine" made me think - probably most suggestions and alternative approaches to regular expressions are trying to generate a (classic) regex string that can be compiled. Would it be better to think of a different way to build that finite state machine without ever building a string representation?
sfussenegger
My example is a mix of a "structure-based" and "text-based" (you want the ability to express self-matching literals using strings, I think).There are multiple possible ways of expressing FSMs, in various ways. Though I think you'd actually do better with a good notation for Augmented Transition Networks, because they're capable of more complex matches than regular expressions, while still being (relatively) easy to reason about.
Vatine
Sounds really interesting. I think I'd enjoy hearing you talk about the topic :) But couldn't you imagine a decent API rather than "a good notation"?
sfussenegger
A good notation should suggest a good API (and vice versa).
Vatine
+3  A: 

Short answer: I have seen it approached from a linting and compiling angle, which I think is something to consider for this.

Long answer: The company I work for makes hardware-based regular expression engines for enterprise content filtering applications. Think running anti-virus or firewall applications at 20GB/sec speeds right in the network routers rather than taking up valuable server or processor cycles. Most anti-virus, anti-spam or firewall apps are a bunch of regex expressions at the core.

Anyway, the way the regex's are written has a huge impact on the performance of the scanning. You can write regexs in several different ways to do the same thing, and some will have drastically faster performance. We've written compilers and linters for our customers to help them maintain and tune their expressions.

Back to the OP's question, rather than defining a whole new syntax, I would write a linter (sorry, ours is proprietary) cut and paste regex's into that will break down legacy regex's and output "fluent english" for someone to understand better. I'd also add relative performance checks and suggestions for common modifications.

SDGator
A: 

Let's compare: I have worked often with (N)Hibernate ICriteria queries, which can be considered a Fluent mapping to SQL. I was (and still am) enthusiastic about them, but did they make the SQL queries more legible? No, more to the contrary, but another benefit rose: it became much easier to programmatically build statements, to subclass them and create your own abstractions etc.

What I'm getting at is that using a new interface for a given language, if done right, can prove worthwhile, but don't think too highly of it. In many cases it won't become easier to read (nested subtraction character classes, Captures in look-behind, if-branching to name a few advanced concepts that will be hard to combine fluently). But in just as many cases, the benefits of greater flexibility outweigh the added overhead of syntax complexity.

To add to your list of possible alternative approaches and to take this out of the context of only Java, consider the LINQ syntax. Here's what it could look like (a bit contrived) (from, where and select are keywords in LINQ):

// for ^str(aa|bb){3}
from part in mystring
where part startswith "str"
and part hasgroups "aa" or "bb" as first    /* "aa" or "bb" in group 'first' */
and part repeats first 3                    /* repeat group 'first' 3 times */
select part + "extra"                       /* can contain complete statement block */

just a rough idea, I know. The good thing of LINQ is that it is checked by the compiler, a kind-of language in a language. By default, LINQ can also be expressed as fluent chained syntax, which makes it, if well designed, compatible with other OO languages.

Abel
+1  A: 

The short answer, for me, is that, once you get to regular expressions (or other pattern matching that does the same thing) that are long enough to cause a problem... you should probably be considering if they're the right tool for the job in the first place.

Honestly, any fluent interface seems like it would be harder to read than a standard regular expression. For really short expressions, the fluent version is verbose, but not too long; it's readable. But so is the regular expression for something that long.

For a medium sized regular expression, a fluent interface becomes unwieldy; long enough that it's hard, if not impossible, to read.

For a long regular expression (ie, the email address one), where the regular expression is actually hard (if not impossible) to read, the fluent version became impossible to read 10 pages ago.

RHSeeger
A: 

4. Would you use such a neat utility in your projects?

I would most likely not. I think this is a case of using the right tool for the job. There are some great answers here such as: 1579202 from Jeremy Stein. I have recently "gotten" regular expressions on a recent project and found them to be very useful just as they are, and when properly commented they are understandable if you know the syntax.

I think the "knowing the syntax" part is what turns people off to Regular Expressions, to those who don't understand they look arcane and cryptic, but they are a powerful tool and in applied Computer Science (e.g. writing software for a living) I feel like intelligent professionals should and should be able to learn to use them and us them appropriately.

As they say "With great power comes great responsibility." I have seen people use regular expressions everywhere for everything, but used judiciously by someone who has taken the time to learn the syntax thoroughly, they are incredibly helpful; to me, adding another layer would in a way defeat their purpose, or at a minimum take away their power.

This just my own opinion and I can understand where people are coming from who would desire a framework like this, or who would avoid regular expressions, but I have hard time hearing "Regular Expressions are bad" from those who haven't take the time to learn them and make an informed decision.

bn
Oh yeah? And where do I come from? :) Believe me, I've seen and used regular expressions extensively (with grep being one of my favorite tools I use on a daily basis). However, this question might have been kind of rant thing as I was trying to decipher a huge old regular expression in our codebase. I just wanted to challenge people to question what they are using each and every day - and looking at some great answers I'm very happy about the outcome so far. I'm particularly unhappy with the boringness of Jeremy Stein's answer though ;)
sfussenegger
+2  A: 

How would you design the API?

I would borrow a page from the Hibernate criteria API. Instead of using:

string("a").anyTimes().or().string("b").times(2,5).compile()

Use a pattern like:

Pattern.or(Pattern.anyTimes("a"), Pattern.times("b", 2, 5)).compile()

This notation is a bit more concise, and I feel that it's easier to understand the hierarchy/structure of the pattern. Each method could accept either a string, or a pattern fragment, as the first argument.

Do you know of any similar approach to regular expressions?

Not offhand, no.

Do you agree that this approach could be better than using simple strings?

Yes, absolutetely... if you are using regex's for anything remotely complicated. For very short and simple cases, a string is more convenient.

Would you use such a neat utility in your projects?

Possibly, as it became proven/stable... rolling it into a larger utility project like Apache Commons might be a plus.

Do you think this would be fun implementing? ;)

+1

RMorrisey
A: 

You DO realise that m/a*|b{2,5}/ actually matches nothing right?

Jim Courier
A: 

To make everybody happy (regex masters and fluid interface proponents), make sure the fluid interface can output an appropriate raw regex pattern, and also take a regular regex using a Factory method and generate fluid code for it.

John K
A: 

I say go for it, I'm sure it's fun to implement.

I suggest using a query model (similar to jQuery, django ORM), where each function returns a query object, so you can chain them together.

any("a").some("b").one("@").some(chars).one(".").some(chars) //a*b+@\w+\.\w+

where chars is predefined to fit any character.

or can be achieved by using choices:

any("a").choice("x", "z") // a(x|z)

The argument to each function can be a string or another query. For example, the chars variable mentioned above can be defined as a query:

//this one is ascii only
chars = raw("a-zA-Z0-9")

And so, you can have a "raw" function that accepts a regex string as input if it feels to cumbersome to use the fluent query system.

Actually, these functions can just return raw regex, and chaining them is simply concatenating these raw regex strings.

any("a") ---> "a*"
raw("b+") ----> "b+"
one(".") ---> "\."
choice("a", "b") ----> (a|b)
hasen j
A: 

I am not sure that replacing regexp with a fluent API would bring much.

Note that I am not a regexp wizard (I have to re-read the doc nearly every time I need to create a regexp).

A fluent API would make any medium-complexity regexp (let's say ~50 characters) even more complex than required and not easier to read in the end, although it may improve the creation of a regexp in an IDE, thanks to code completion. But code maintenance generally represents a higher cost than code development.

In fact, I am not even sure it would be possible to have an API smart enough to really provide enough guidance to the developer when creating a new regexp, not talking about ambiguous cases, as mentioned in a previous answer.

You mentioned a regexp example for an RFC. I am 99% sure (there is still 1% hope;-)) that any API would not make that example any simpler, but conversely that would only make it more complex to read! That's a typical example where you don't want to use regexp anyway!

Even regarding regexp creation, due to the problem of ambiguous patterns, it is probable that with a fluent API, you would never come with the right expression the first time, but would have to change several times until you get what you really want.

Make no mistake, I do love fluent interfaces; I have developed some libraries that use them, and I use several 3rd-party libraries based on them (e.g. FEST for Java testing). But I don't think they can be the golden hammer for any problem.

If we consider Java exclusively, I think the main problem with regexps is the necessary escaping of backslashes in Java string constants. That's one point that makes it incredibly difficult to create and understand regexp in Java. Hence, the first step to enhance Java regexp would, for me, be a language change, a la Groovy, where string constants don't need to escape backslashes.

So far that would be my only proposal to improve regexp in Java.

jfpoilpret