views:

233

answers:

2

Im using this regex to get the contents of a tag in a file.

var regex = new RegExp("<tag:main>((?:.|\\s)*)</tag:main>");

This causes the v8 engine to hang indefinitely.

Now, if I use new RegExp("<tag:main>([\s\S]*)</tag:main>"), all is good.

Anyone have an idea why the first one takes too long?

+3  A: 

I presume that it is catastrophically back tracking.

I think that part of the issue may well be that the dot and \s are not mutually exclusive.

If I change your expression to

<tag:main>((?:.|[\r\n])*)</tag:main>

and run it in the Regex Buddy debugger it fails a lot quicker in the event that the test string is not a match.

Martin Smith
.|\s is to match all characters. Because . matches all characters except new line.
Engwan
I don't think it will do that. I pasted your Regex into RegexBuddy and pasted its comment tree into my post.
Martin Smith
You should remove the extra \ before pasting to RegexBuddy. The \\ is used because it is a javascript string passed to RegExp constructor.
Engwan
Whoops! If you make it lazy rather than greedy does this stop the issue?<tag:main>((?:.|\s)*?)</tag:main>
Martin Smith
I have completely rewritten my answer now!
Martin Smith
+4  A: 

This catastrophically backtracks on long sequences of spaces that occur after the last closing </tag:main> tag. Consider the case where the subject string ends with 100 spaces. First it matches them all with the . on the left of the alternation. That fails because there's no closing tag so it tries matching the last character with the \s instead. That fails too so it tries matching the second-to-last space as a \s and the last space as a '.'. That fails (still no closing tag) so it tries the last space as a \s. When that fails it matches the third to last space as a \s and tries all 4 ways to match the last two spaces. When that fails it tries the fourth-to-last space as a \s and all 8 ways on the last 3 spaces. Then 16, 32 etc. The universe ends before it gets to the 100th-to-last space.

Different VMs have different reactions to regexp matches that take forever because of catastrophic backtracking. Some will simply report 'no match'. In V8 it's like writing any other infinite or near-infinite loop.

Using non-greedy * will do what you want (you want to stop at the first </tag:main>, not the last), but will still do catastrophic backtracking for long strings of spaces where the closing sequence is missing.

Making sure the same characters in the inner bracket can't match both sides of the alternation will reduce the problem from an exponential one to one that is linear in the length of the string. Use a character class instead of an alternation or put \n on the right hand side of the alternation bar. \n is disjoint with '.' so if you hit a long sequence of spaces the regexp engine doesn't try all left-right-left etc. combinations before terminating.

Erik Corry
Good explanation. Do you happen to know if dot includes \r as well?
Martin Smith
@Martin: in JavaScript, `.` is equivalent to `[^\r\n\u2028\u2029]`
Alan Moore
@Alan - Thanks!
Martin Smith