views:

769

answers:

4

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.

On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.

The Regex I'm using is this:

/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/

Any suggestions on why this might get locked up?

Alternatively, what's another Regex I could use instead?

More information:

  • Working in C# 3.0 targeting .NET 3.5;
  • I'm using the Regex.Match(string,int) method to start matching at a particular index of the string;
  • I've left the program running for over an hour, but the match isn't completed;
  • Options passed to the Regex constructor are RegexOptions.Multiline and RegexOptions.IgnorePatternWhitespace;
  • The regex works correctly for 452 of my 453 test files.
+1  A: 

You may want to try the option Singleline rather than Multiline, then you don't need to worry about \r\n. With that enabled the following worked for me with a simple test which included comments that spanned more than one line:

/\*.*?\*/
it depends
But won't the greedy matching of the regex matcher mean that this will span from the /* opening of the first comment */ to the */ closing of the last */ ?
Bevan
The ? makes the .* non-greedy or lazy, as described here: http://www.regular-expressions.info/repeat.html#lazy
Alan Moore
+1  A: 

I think your expression is way too complicated. Applied to a large string, the many alternatives imply a lot of backtracking. I guess this is the source of the performance hit you see.

If the basic assumption is to match everything from the "/*" until the first "*/" is encountered, then one way to do it would be this (as usual, regex is not suited for nested structures, so nesting block comments does not work):

/\*(.(?!\*/))*.?\*/             // run this in single line (dotall) mode

Essentially this says: "/*", followed by anything that itself is not followed by "*/", followed by "*/".

Alternatively, you can use the simpler:

/\*.*?\*/                       // run this in single line (dotall) mode

Non-greedy matching like this has the potential to go wrong in an edge case - currently I can't think of one where this expression might fail, but I'm not entirely sure.

Tomalak
+4  A: 

Some problems I see with your regex:

There's no need for the |[\r\n] sequences in your regex; a negated character class like [^*] matches everything except '*', including line separators. It's only when you're using the '.' (dot) metacharacter that you have to worry about those.

Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.

Filtering out extraneous junk, the final alternative inside your outermost parens is \*+[^*/] -- that means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- \*+/ -- will match them "for reals".

For maximum performance, I would use this regex:

/\*(?>(?:(?>[^*]+)|\*(?!/))*)\*/

This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as qucikly as possible.

Alan Moore
Thanks Alan, this looks like what I need - though I need to study it a bit to make sure I understand it! Will report back after I've tried it.
Bevan
Apologies - forgot to report back promptly. Yes, this is what I needed. Thanks for your help.
Bevan
A: 

No no no! Hasn't anyone else read Mastering Regular Expressions (3rd Edition)!? In this classic work, Jeffrey Friedl thoroughly examines this exact problem and uses it as an example (pages 272-276) to illustrate the benefits of his advanced "unrolling-the-loop" efficiency technique. The best solution for most regex engines is like so:

/\*[^*]*\*+(?:[^*/][^*]*\*+)*/

However, if the regex engine is optimized to handle lazy quantifiers (like Perl's is), then the most efficient expression is much simpler (as suggested above):

/\*.*?\*/

(With the equivalent 's' "dot matches all" modifier applied of course.) Note that I don't use .NET so I can't say which version is faster for that engine.

ridgerunner