ansaurus

Question

Adding a single character to my .NET RegEx causes it to hang..

Answer 1

+6 A:

With RegexOptions.IgnorePatternWhitespace, you're telling the engine to ignore whitespaces in your pattern. Thus, when you write Cust No in the pattern, it really means CustNo, which doesn't match the input. This is the cause of the problem.

From the documentation:

By default, white space in a regular expression pattern is significant; it forces the regular expression engine to match a white-space character in the input string. [...]

The RegexOptions.IgnorePatternWhitespace option, or the x inline option, changes this default behavior as follows:

Unescaped white space in the regular expression pattern is ignored. To be part of a regular expression pattern, white-space characters must be escaped (e.g. as \s or "\ ").

So instead of Cust No, in IgnorePatternWhitespace mode, you must write Cust\ No, because otherwise it's interpreted as CustNo.

polygenelubricants 2010-06-04 13:24:27

Good catch! Thanks

Matt 2010-06-04 13:50:56

Answer 2

+1 A:

polygenelubricants already explained why your regex failed. The reason it hangs is that you're running into catastrophic backtracking. Your regex has many parts that can match the same text in a lot of different ways. If the overall match fails, the regex engine will try all possible permutations until it either exhausts them all or aborts with a Stack Overflow.

E. g. in To:\W+(?<custAddr>.*?)\W+ the .*? will gladly match the same characters as \W, and since you're using Singleline, the .*? will also cross over into the No:... part of the input text and further and further. In your example, I tested in RegexBuddy what happens if you add the "N" after "Cust" - the regex engine aborts after 1,000,000 steps.

To avoid this, you need to make the regex more specific, or (this might be the better option in this case) keep the regex engine from backtracking by enclosing parts that have already matched in "atomic groups":

(?>\W+INVOICE\W+)
(?>(?<shopAddr>.*?)\W+To:)
(?>\W+(?<custAddr>.*?)\W+No:)
(?>\W+(?<invNo>\d+).*?Date:)
(?>\W+(?<invDate>[0-9/\ :]+)\W+Ref:)
(?>\W+(?<ref>[\w\ ]*?)\W+Cust)

This allows the regex to fail much faster if the input and the regex happen not to fit together.

Tim Pietzcker 2010-06-04 16:39:24

+1. This post really makes me want to grab RegexBuddy so I can do benchmarking on my own.

polygenelubricants 2010-06-04 16:47:01

ansaurus

tags:

views:

answers:

Adding a single character to my .NET RegEx causes it to hang..

related questions