views:

368

answers:

2

Got a simple task to get a XPath expression and return a prefix that matches the parent of the node that (might be) selected.

Example:

/aaa/bbb       =>   /aaa
/aaa/bbb/ccc   =>   /aaa/bbb
/aaa/bbb/ccc[@x='1' and @y="/aaa[name='z']"] => /aaa/bbb

Because the patterns inside the square brackets might contain brackets within quotes, I decided to try to achieve this with the use of regular expressions. Here's a code snippet:

string input =
    "/aaa/bbb/ccc[@x='1' and @y=\"/aaa[name='z'] \"]";
                                            //  ^-- remove space for no loop
string pattern = @"/[a-zA-Z0-9]+(\[([^]]*(]"")?)+])?$";

System.Text.RegularExpressions.Regex re =
    new System.Text.RegularExpressions.Regex(pattern);
bool ismatch = re.IsMatch(input); // <== Infinite loop in here
// some code based on the match

Because the patterns are rather regular, I looked for '/' followed by indentifier followed by an optional group that matches at the end of the string (....)?$

The code seemd to work but playing with different values for the input string, I found that by simply inserting a space (in the location shown in the comment), the .NET IsMatch function gets into an infinite loop, taking all the CPU it gets.

Now regardless of whether this regular expression pattern is the best one (I had more complex but simplified it to show the problem), this seems to show that using RegEx with anything not trivial may be very risky.

Am I missing something? Is there a way to guard against infinite loops in regular expression matches?

+1  A: 

It shows that using code with anything not trivial can be risky. You created code that can result in an infinite loop, and the RegEx compiler obliged. Nothing new that hasn't been done since the first 20 IF X=0 THEN GOTO 10.

If you're worried about this in a particular edge case, you could spawn a thread for RegEx and then kill it after some reasonable execution time.

richardtallent
I find this answer contra productive. Other RegEx engines I tried didn't get into an infinite loop (try, for example, the online JavaScript RegEx tester at http://www.regular-expressions.info/javascriptexample.html and you'll see it works just fine).This regular expression is simple enough and I do not find it trivial that its _expected_ failure mode (when no match is found) is an infinite loop.The thread idea is not useful either. Should I use this idea anywhere an external RegEx is provided? I don't think so. I think this is probably a bug in the RegEx (or else a big gaping hole).
Dror Harari
+2  A: 

Ok, let's break this down then:

Input: /aaa/bbb/ccc[@x='1' and @y="/aaa[name='z'] "]
Pattern: /[a-zA-Z0-9]+(\[([^]]*(]")?)+])?$

(I assume you meant \" in your C#-escaped string, not ""... translation from VB.NET?)

First, /[a-zA-Z0-9]+ will gobble up through the first square bracket, leaving:

Input: [@x='1' and @y="/aaa[name='z'] "]

The outer group of (\[([^]]*(]"")?)+])?$" should match if there is 0 or 1 instance before the EOL. So let's break inside and see if it matches anything.

The "[" gets gobbled right away, leaving us with:

Input: @x='1' and @y="/aaa[name='z'] "]
Pattern: ([^]]*(]")?)+]

Breaking down the pattern: match 0 or more non-] characters and then match "] 0 or 1 times, and keep doing this until you can't. Then try to find and gobble a ] afterward.

The pattern matches based on [^]]* until it reaches the ].

Since there's a space between ] and ", it can't gobble either of those characters, but the ? after (]") allows it to return true anyway.

Now we've successfully matched ([^]]*(]")?) once, but the + says we should attempt to keep matching it any number of times we can.

This leaves us with:

Input: ] "]

The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying.

You're essentially matching "1 or more" situations where you can match "0 or 1" of something followed by "0 or 1" of something else. Since neither of the two subpatterns exists in the remaining input, it keeps matching 0 of [^]]\* and 0 of (]")? in an endless loop.

The input never gets gobbled, and the rest of the pattern after the "+" never gets evaluated.

(Hopefully I got the SO-escape-of-regex-escape right above.)

richardtallent
Well that was productive (to me) - thanks Richard.I conclude that:1. Getting a regex pattern from an external source is dangerous and can easily hose an application2. That the regex in .NET does not detect infinite loops and is also not providing a way to limit processing3. That different regex engine can give different results so even if the syntax is the same, some semantics may be different (portability note)Thanks.
Dror Harari
I think the differences you saw were due to the different dialects of regex, not fancy infinite-loop-detection in other engines.The core problem is wrapping something that can match *empty text* an infinite number of times. Anything that's a variation of (x?)+ or (x?)* could be dangerous given the right input. Refactoring your pattern should allow you to get what you need without creating a potential for an infinite loop.Regardless of the language, the lesson is to always program defensively against arbitrary user input.
richardtallent
The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying. WHY ????
MemoryLeak