views:

424

answers:

4

Hi,

I've got a simple JavaScript regex check (written by other developer) that works perfectly on thousands of different strings. However I've just discovered one particular string value that's causing that regex to take as long as 10min to execute in Firefox/IE which is unacceptable. I've extracted an actual regex call into small code snippet for your convenience:

<html>
  <script>
    function dodo(){
      var mask = /^([\w'#@\-\&\(\)\/.]+[ ]*){1,100}$/;
      var value = "Optometrists Association Australia, Queensland/NT Division";
      mask.exec(value);
    }
  </script>
  <body>
    <input type="button" value="Click" onclick="dodo()">
  </body>
</html>

What is the problem here? If I change value to anything else it works perfectly.

Thank you!

+1  A: 

removing the comma from the string or adding it to the character group makes it execute quickly, but without examples of correct operation or an explanation of what you're trying to achieve I can't say for sure if it works correctly...

Luke Schafer
+5  A: 

You probably meant to have a + after the space group, rather than *. If you replace it back with a +, things go much faster. The * causes the regex evaluator to try a huge number of combinations, all of which fail when they reach the ','. You might want to add a ',' to the first character group too.

Overall, it might look like this:

var mask = /^([\w'#@\-\&\(\)\/.,]+[ ]+){1,100}$/;
Oren Trutner
Was just about to write this, +1
Matt Bridges
Just noticed: While the [ ]+ solves the backtracking issue, it means that all captures have to end with a space.
ojrac
You can add braces around the non-whitespace portion to capture it, and turn the braces already there to a non-capturing group (?:pattern) -- see http://www.regextester.com/jssyntax.html.
Oren Trutner
+5  A: 

You're running in to crazy backtracking, a common feature in regexes that includes something of the form ([characters]+)+ -- it runs great for all sorts of matching patterns, but then you find a string like this, which makes it explode, recursing all over the string. Here's a sketch of what happens.

To start out, your pattern splits the string into groups. I use a | to start instances of your group, the one you repeat {1,100}. > is the end of a group, and ? is the regex parser's "cursor."

|----------->|---------->|-------?
Optometrists Association Australia, Queensland/NT Division

At the ?, your pattern can't match any more symbols or whitespace, so it tries to match $. Since the cursor hasn't reached the end of the line yet, it fails, and the regex parser backtracks:

|----------->|---------->|------?
Optometrists Association Australia, Queensland/NT Division

Once again, it can't find any whitespace, so it terminates the group, and tries to start another one (since there can be up to 100, and we've only used 3 so far).

|----------->|---------->|------|-?
Optometrists Association Australia, Queensland/NT Division

The parser has reached the problematic , again, and it kills that execution tree, causing it to backtrack once more, to the i in Australia. And, just like last time, it tries to start a group:

|----------->|---------->|-----|--?
Optometrists Association Australia, Queensland/NT Division

...anyway, you get the idea. This cycle of fail, backtrack and slice again will effectively freeze up your Regex parser until it's exhausted every permutation and returned false. The key to recognizing and fixing this is to never repeat a repeated group without some form of delimiter at the beginning and/or end. I'd suggest using the word boundary anchor \b, since [ ]+ would require your strings to end in whitespace:

/^(\b[\w'#@\-\&\(\)\/.]+\b[ ]*){1,100}$/

As a side-note, it's hard to tell what your regex is doing without more context, but it seems like you could also just call value.split(' ') to split the string on whitespace characters, and run a simpler regex on all those substrings. It'd eliminate the need for the double regex repeat.

ojrac
Good explanation. Too bad javascript does not support possessive quantifiers. They would come in handy here to kill an awful lot of backtracking.
Geert
+3  A: 

This looks like a poor application for a regex, and a poor regex to boot. The intent seems to be to match a list of between 1 and 100 space-separated "words", I think. Here are the core problems I can see:

  1. The use of "[ ]*" at the end of the word, instead of "[ ]+" means that every byte can potentially be a "word" alone, whether it's bounded by spaces or not. That's a lot of match cases for your engine to keep track of.

  2. You're using capturing parentheses ("(...)") instead non-capturing ones ("(?:...)"). The grouping will be doing yet more bookeeping to save the last word matched for you, which you probably doing need or not.

And some minor issues:

  1. The "[ ]*" expression is redundant. Just use " *" to match zero or more spaces. But you probably want "\s" there, to match whitespace of any type, not just a space.

  2. The expression allows whitespace at the end of the string, but not the beginning. Most applications usually want to tolerate both or neither.

  3. For readability, don't use backslash escaping where it's not needed. Only the "-" in your bracket actually needs it.

  4. What's magic about 100? Do you really want to hard-code that limit?

Finally, why use a regex here at all? Why not simply split() on whitespace into an array of substrings, and then test each resulting word against a simpler expression?

Andy Ross