tags:

views:

62

answers:

2

Here is the input data:

                                *** INVOICE ***                                

                              THE BIKE SHOP                              
                      1 NEW ROAD, TOWNVILLE,                       
                          SOMEWHERE, UK, AB1 2CD                          
                        TEL 01234-567890  

 To: COUNTER SALE                                   No:  243529 Page: 1

                                                    Date: 04/06/10 12:00

                                                    Ref:    Aiden   

 Cust No: 010000                 

Here is a regex that works (Options: singleline, ignorewhitespace, compiled) - it matches immediately and the groups are properly populated:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust 

As soon as I add the 'N' out of Cust No into the rex, parsing the input hangs forever:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust N

If I add something like "any character" :

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust .

It works, but as soon as I add a fixed character, the rex hangs again:

\W+INVOICE\W+
(?<shopAddr>.*?)\W+
To:\W+(?<custAddr>.*?)\W+
No:\W+(?<invNo>\d+).*?
Date:\W+(?<invDate>[0-9/ :]+)\W+
Ref:\W+(?<ref>[\w ]*?)\W+
Cust ..:

Can anyone advise why adding something so trivial would cause it to fall over? Can I enable some kind of tracing to watch the matching activity to see if it is getting stuck in a catastrophic backtrack?

+6  A: 

With RegexOptions.IgnorePatternWhitespace, you're telling the engine to ignore whitespaces in your pattern. Thus, when you write Cust No in the pattern, it really means CustNo, which doesn't match the input. This is the cause of the problem.

From the documentation:

By default, white space in a regular expression pattern is significant; it forces the regular expression engine to match a white-space character in the input string. [...]

The RegexOptions.IgnorePatternWhitespace option, or the x inline option, changes this default behavior as follows:

  • Unescaped white space in the regular expression pattern is ignored. To be part of a regular expression pattern, white-space characters must be escaped (e.g. as \s or "\ ").

So instead of Cust No, in IgnorePatternWhitespace mode, you must write Cust\ No, because otherwise it's interpreted as CustNo.

polygenelubricants
Good catch! Thanks
Matt
+1  A: 

polygenelubricants already explained why your regex failed. The reason it hangs is that you're running into catastrophic backtracking. Your regex has many parts that can match the same text in a lot of different ways. If the overall match fails, the regex engine will try all possible permutations until it either exhausts them all or aborts with a Stack Overflow.

E. g. in To:\W+(?<custAddr>.*?)\W+ the .*? will gladly match the same characters as \W, and since you're using Singleline, the .*? will also cross over into the No:... part of the input text and further and further. In your example, I tested in RegexBuddy what happens if you add the "N" after "Cust" - the regex engine aborts after 1,000,000 steps.

To avoid this, you need to make the regex more specific, or (this might be the better option in this case) keep the regex engine from backtracking by enclosing parts that have already matched in "atomic groups":

(?>\W+INVOICE\W+)
(?>(?<shopAddr>.*?)\W+To:)
(?>\W+(?<custAddr>.*?)\W+No:)
(?>\W+(?<invNo>\d+).*?Date:)
(?>\W+(?<invDate>[0-9/\ :]+)\W+Ref:)
(?>\W+(?<ref>[\w\ ]*?)\W+Cust)

This allows the regex to fail much faster if the input and the regex happen not to fit together.

Tim Pietzcker
+1. This post really makes me want to grab RegexBuddy so I can do benchmarking on my own.
polygenelubricants