tags:

views:

633

answers:

4

Hi All,

Let's consider the two following lines in C# (using framework .NET 3.5)

Regex regex = new Regex(@"^((E|e)t )?(M|m)oi (?<NewName>[A-Za-z]\.?\w*((\-|\s)?[A-Za-z]?\w{1,})+)$", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match m = regex.Match("moi aussi jaimerai etre un ordinateur pour pas m'énnerver ");

(sorry it's a french program :))

When they are executed, the process gets stuck in the regex.Match() method and never steps out. I guess there is some problem with the white space in the regex pattern but what I would like to do is not changing the pattern (actually it is set out of the program by the end users of my tool) but being able to stop the process (with a timeout for instance)

Does someone know if this is well-known problem with the .NET Regular Expression and if there is an easy way to work around it or do I have to thread these lines and abort them if needed (Definetly I wouldn't like to do that)

A: 

In general, regular expressions can take longer than you expect. You should experiment with the regular expression hjusing a tool like Regulator.

John Saunders
in this case, the process time is infinite ! And without any out of memory or stack overflow exception
PierrOz
It's entirely possible that this Regex does not complete.
John Saunders
A: 

The problem is that you have nested "loops" in your Regex, which make it terribly inefficient (so that it basically takes forever due to the complexity of the expression).

If you say what you would like to match, I can try to figure out a more efficient Regex for that.

Lucero
+2  A: 

If I enter the expression in Regexbuddy, it presents following message

The match attempt was aborted early because the regular expression is too complex. The regex engine you plan to use it with may not be able to handle it at all and crash. Look up "catastrophic backtracking" in the help file to learn how to avoid this situation.

Looking up catastrophic backtracking gives the following explanation

Runaway Regular Expressions: Catastrophic Backtracking
Consider the regular expression (x+x+)+y. Before you scream in horror and say this contrived example should be written as xx+y to match exactly the same without those terribly nested quantifiers: just assume that each "x" represents something more complex, with certain strings being matched by both "x". See the section on HTML files below for a real example.

Let's see what happens when you apply this regex to xxxxxxxxxxy. The first x+ will match all 10 x characters. The second x+ fails. The first x+ then backtracks to 9 matches, and the second one picks up the remaining x. The group has now matched once. The group repeats, but fails at the first x+. Since one repetition was sufficient, the group matches. y matches y and an overall match is found. The regex is declared functional, the code is shipped to the customer, and his computer explodes. Almost.

The above regex turns ugly when the y is missing from the subject string. When y fails, the regex engine backtracks. The group has one iteration it can backtrack into. The second x+ matched only one x, so it can't backtrack. But the first x+ can give up one x. The second x+ promptly matches xx. The group again has one iteration, fails the next one, and the y fails. Backtracking again, the second x+ now has one backtracking position, reducing itself to match x. The group tries a second iteration. The first x+ matches but the second is stuck at the end of the string. Backtracking again, the first x+ in the group's first iteration reduces itself to 7 characters. The second x+ matches xxx. Failing y, the second x+ is reduced to xx and then x. Now, the group can match a second iteration, with one x for each x+. But this (7,1),(1,1) combination fails too. So it goes to (6,4) and then (6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think you start to get the drift.

If you try this regex on a 10x string in RegexBuddy's debugger, it'll take 2558 steps to figure out the final y is missing. For an 11x string, it needs 5118 steps. For 12, it takes 10238 steps. Clearly we have an exponential complexity of O(2^n) here. At 21x the debugger bows out at 2.8 million steps, diagnosing a bad case of catastrophic backtracking.

RegexBuddy is forgiving in that it detects it's going in circles, and aborts the match attempt. Other regex engines (like .NET) will keep going forever, while others will crash with a stack overflow (like Perl, before version 5.10). Stack overflows are particularly nasty on Windows, since they tend to make your application vanish without a trace or explanation. Be very careful if you run a web service that allows users to supply their own regular expressions. People with little regex experience have surprising skill at coming up with exponentially complex regular expressions.

I assume you are going to have to handle it in code. I'd suggest you contact the author of Regexbuddy and ask what is needed to detect this scenario.

Lieven
RegexBuddy uses its own Regex engine and supports all major Regex flavours. My guess is that it precomputes the no. of iterations/steps required and aborts the match when that number exceeds the maximum allowed by the program.
Cerebrus
from http://en.wikipedia.org/wiki/Regular_expression : "(...)The third algorithm is to match the pattern against the input string by backtracking. This algorithm is commonly called NFA, but this terminology can be confusing. Its running time can be exponential, which simple implementations exhibit when matching against expressions like (a|aa)*b that contain both alternation and unbounded quantification and force the algorithm to consider an exponentially increasing number of sub-cases. (...)". It must be part of an explanation...
PierrOz
+1  A: 

I think you should simply launch the Regex match on a separate thread and allow it to be aborted if a certain maximum time limit is reached.

Cerebrus