views:

122

answers:

4

Hi,

I have a complex regular expression I've built with code. I want to normalize it to the simplest (canonical) form that will be an equivalent regular expression but without the extra brackets and so on.

I want it to be normalized so I can understand if it's correct and find bugs in it.

Here is an example for a regular expression I want to normalize:

^(?:(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))(?:(?:(?:\r\n(?:[ \t]+))*),(?:(?:\r\n(?:[ \t]+))*)(<transfer-coding>(?:chunked|(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)(?:(?:;(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)=(?:(?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)|(?:"(?:(?:(?:|[^\x00-\x31\x127\"])|(?:\\[\x00-\x127]))*)))))*))))*))$

Thank you.

+1  A: 

I am not aware of any tool that can do this. I even strongly doubt there is something like a canonical form for regular expressions - they are complex enough that there are usually several and vastly different solutions.

If this expression is the output of an generator it seems much more promising to me to (unit)test the code generator.

Daniel Brückner
See http://code.msdn.microsoft.com/RegexWorkbench. It's not a tool that shortens regexes, but it does make it easier to understand complicated ones.
Gabe
+8  A: 

I'm with the other answers and comments so far. Even if you could define a reduced form, it's unlikely that the reduced form is going to be any more understandable than this thing, which resembles line noise on a 1200 baud modem.

If you did want to find a canonical form for regular expressions, i'd start by defining precisely what you mean by "canonical form". For example, suppose you have the regular expression [ABCDEF-I]. Is the canonical form (1) [ABCDEF-I], (2) [ABCDEFGHI] or (3) [A-I] ?

That is, for purposes of canonicalization, do you want to (1) ignore this subset of regular expressions for the purposes of canonicalization, (2) eliminate all "-" operators, thereby simplifying the expression, or (3) make it shorter?

The simplest way would be to go through every part of the regular expression specification and work out which subexpressions are logically equivalent to another form, and decide which of the two is "more canonical". Then write a recursive regular expression analyzer that goes through a regular expression and replaces each subexpression with its canonical form. Keep doing that in a loop until you find the "fixed point", the regular expression that doesn't change when you put it in canonical form.

That, however, will not necessarily do what you want. If what you want is to reorganize the regular expression to minimize the complexity of grouping or some such thing then what you might want to do is to canonicalize the regular expression so that it is in a form such that it only has grouping, union and Kleene star operators. Once it is in that form you can easily translate it into a deterministic finite automaton, and once it is in DFA form then you can run a graph simplification algorithm on the DFA to form an equivalent simpler DFA. Then you can turn the resulting simplified DFA back into a regular expression.

Though that would be fascinating, like I said, I don't think it would actually solve your problem. Your problem, as I understand it, is a practical one. You have this mess, and you want to understand that it is right.

I would approach that problem by a completely different tack. If the problem is that the literal string is hard to read, then don't write it as a literal string. I'd start "simplifying" your regular expression by making it read like a programming language instead of reading like line noise:

Func<string, string> group = s=>"(?:"+s+")";
Func<string, string> capture = s=>"("+s+")";
Func<string, string> anynumberof = s=>s+"*";
Func<string, string> oneormoreof = s=>s+"+";
var beginning = "^";
var end = "$";
var newline = @"\r\n";
var tab = @"\t";
var space = " ";
var semi = ";";
var comma = ",";
var equal = "=";
var chunked = "chunked";
var transfer = "<transfer-coding>";
var backslash = @"\\";
var escape = group(backslash + @"[\x00-\x7f]");
var or = "|";
var whitespace = 
    group(
        anynumberof(
            group(
                newline +  
                group(
                    oneormoreof(@"[ \t]")))));
var legalchars = 
    group(
        oneormoreof(@"[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]"));

var re = 
    beginning + 
    group(
        whitespace + 
        capture(
            transfer + 
            group(
                chunked + 
                or + 
                group(
                    legalchars + 
                    group(
                        group(
                            semi + 
                            anynumberof(
                                group(
                                    legalchars + 
                                    equal +

...

Once it looks like that it'll be a lot easier to understand and optimize.

Eric Lippert
This is the coolest thing I've seen in a long time... Then I saw who posted it. Eric, you never cease to amaze me with your insights. Wish I could give a million upvotes for this one.
Moose
There are libraries that do that in a more readable way, for example: http://flimflan.com/blog/ReadableRegularExpressions.aspx
Kobi
What does `noq` mean? Why do `beginning` and `end` have `@` signs in front of their strings, but `newline`, `tab`, and `backslash` don't take advantage of it? And I think `bar` reads better as `or`.
Gabe
Gabe: groupwithnoquestionmark was too long. Because I was typing quickly and didn't bother to edit out the @ signs when I realized I didn't need them. No particular reason. Sure. Remember, this is just a sketch of an idea; my intention was not to develop a fully-fledged fluent model for regular expressions.
Eric Lippert
OK, then allow me to make a slight improvement. I don't want to change somebody else's code if they had a good reason for it being how it is.
Gabe
+1  A: 

I'd just write it in an expanded form:

^
(?:
  (?: (?: \r\n (?:[ \t]+) )* )
  (<transfer-coding>
    (?: chunked
     |  (?:
          (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
          (?:
            (?:
              ;
              (?:
                (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
                =
                (?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
                 |  (?:
                      "
                      (?:
                        (?:
                          (?:
                           |  [^\x00-\x31\x127\"]
                           )
                         | (?:\\[\x00-\x127])
                         )*
                      )
                    )
                )
              )
            )*
          )
        )
    )
  )
  (?:
    (?: (?: \r\n (?:[ \t]+) )* )
    ,
    (?: (?: \r\n (?:[ \t]+) )* )
    (<transfer-coding>
      (?: chunked
       |  (?:
            (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
            (?:
              (?:
                ;
                (?:
                  (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
                  =
                  (?: (?:[\x21\x23-\x27\x2A\x2B\x2D\x2E0-9A-Z\x5E\x7A\x7C\x7E-\xFE]+)
                   |  (?:
                        "
                        (?:
                          (?:
                            (?:
                             |  [^\x00-\x31\x127\"]
                             )
                           | (?:\\[\x00-\x127])
                           )*
                        )
                      )
                  )
                )
              )*
            )
          )
      )
    )
  )
)
$

You can quickly locate unnecessary grouping, and locate some errors. Some errors i saw:

  • Missing ? for the named groups. It should be (?<name> ).
  • No closing double quote (").

You can even use the regex in this form. If you supply the flag RegexOptions.IgnorePatternWhitespace when constructing the Regex object, any whitespace or comments (#) in the pattern will be ignored.

MizardX
re named groups: if you're talking about `<transfer-coding>`, that's not a valid group name anyway. .NET group names can only contain letters and digits.
Alan Moore
+3  A: 

I think you're getting ahead of yourself; the problems with that regex are not just cosmetic. Many of the parentheses can simply be dropped, as in (?:[ \t]+), but I suspect some of them are changing the meaning of the regex in ways you didn't intend.

For example, what's (?:|[^\x00-\x31\x127\"]) supposed to mean? With that pipe at the beginning, it's equivalent to [^\x00-\x31\x127\"]??--zero or one, reluctantly, of whatever the character class matches. Is that really what you intended?

The character class itself is highly suspect as well. It's obviously meant to match anything other than an ASCII control character or a quotation mark, but the numbers are decimal where they should be hexadecimal: [^\x00-\x1E\x7F\"]

Alan Moore