ansaurus

Question

How to avoid infinite loops in the .NET RegEx class?

Answer 1

+1 A:

It shows that using code with anything not trivial can be risky. You created code that can result in an infinite loop, and the RegEx compiler obliged. Nothing new that hasn't been done since the first 20 IF X=0 THEN GOTO 10.

If you're worried about this in a particular edge case, you could spawn a thread for RegEx and then kill it after some reasonable execution time.

richardtallent 2009-07-29 14:32:42

I find this answer contra productive. Other RegEx engines I tried didn't get into an infinite loop (try, for example, the online JavaScript RegEx tester at http://www.regular-expressions.info/javascriptexample.html and you'll see it works just fine).This regular expression is simple enough and I do not find it trivial that its _expected_ failure mode (when no match is found) is an infinite loop.The thread idea is not useful either. Should I use this idea anywhere an external RegEx is provided? I don't think so. I think this is probably a bug in the RegEx (or else a big gaping hole).

Dror Harari 2009-07-29 19:18:03

Answer 2

+2 A:

Ok, let's break this down then:

Input: /aaa/bbb/ccc[@x='1' and @y="/aaa[name='z'] "]
Pattern: /[a-zA-Z0-9]+(\[([^]]*(]")?)+])?$

(I assume you meant \" in your C#-escaped string, not ""... translation from VB.NET?)

First, /[a-zA-Z0-9]+ will gobble up through the first square bracket, leaving:

Input: [@x='1' and @y="/aaa[name='z'] "]

The outer group of (\[([^]]*(]"")?)+])?$" should match if there is 0 or 1 instance before the EOL. So let's break inside and see if it matches anything.

The "[" gets gobbled right away, leaving us with:

Input: @x='1' and @y="/aaa[name='z'] "]
Pattern: ([^]]*(]")?)+]

Breaking down the pattern: match 0 or more non-] characters and then match "] 0 or 1 times, and keep doing this until you can't. Then try to find and gobble a ] afterward.

The pattern matches based on [^]]* until it reaches the ].

Since there's a space between ] and ", it can't gobble either of those characters, but the ? after (]") allows it to return true anyway.

Now we've successfully matched ([^]]*(]")?) once, but the + says we should attempt to keep matching it any number of times we can.

This leaves us with:

Input: ] "]

The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying.

You're essentially matching "1 or more" situations where you can match "0 or 1" of something followed by "0 or 1" of something else. Since neither of the two subpatterns exists in the remaining input, it keeps matching 0 of [^]]\* and 0 of (]")? in an endless loop.

The input never gets gobbled, and the rest of the pattern after the "+" never gets evaluated.

(Hopefully I got the SO-escape-of-regex-escape right above.)

richardtallent 2009-07-29 21:37:57

Well that was productive (to me) - thanks Richard.I conclude that:1. Getting a regex pattern from an external source is dangerous and can easily hose an application2. That the regex in .NET does not detect infinite loops and is also not providing a way to limit processing3. That different regex engine can give different results so even if the syntax is the same, some semantics may be different (portability note)Thanks.

Dror Harari 2009-07-29 22:19:19

I think the differences you saw were due to the different dialects of regex, not fancy infinite-loop-detection in other engines.The core problem is wrapping something that can match *empty text* an infinite number of times. Anything that's a variation of (x?)+ or (x?)* could be dangerous given the right input. Refactoring your pattern should allow you to get what you need without creating a potential for an infinite loop.Regardless of the language, the lesson is to always program defensively against arbitrary user input.

richardtallent 2009-07-30 06:48:30

The problem here is that this input can match ([^]]*(]")?) an infinite of times without ever being gobbled up, and "+" will force it to just keep trying. WHY ????

MemoryLeak 2009-09-29 14:21:52

ansaurus

tags:

views:

answers:

How to avoid infinite loops in the .NET RegEx class?

related questions