tags:

views:

90

answers:

2

Hello!

I found some problem while testing my NLP system. I have a java regex "(.*\\.\\s*)*Dendryt.*" and for string "v Table of Contents List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . " it just dont stop computing.

Its clear that this regex complexity is very high, I will try to refactor it. Have you some suggestions for me for a future regex development ???

Thanks.

+6  A: 

You're running into catastrophic backtracking by repeating a group containing repeated quantifiers. The combinatorial explosion that follows will then (given enough input) lead to a (tada!) Stack Overflow.

Simplified, your regex tries to

(.*\.\s*) match any succession of characters including dots and spaces, followed by a dot, followed by zero or more spaces, then

(...)* repeat this any number of times.

Dendryt Only then it tries to match "Dendryt".

Since this fails, the engine backtracks, trying a different permutation. The possibilities are nearly endless...

To illustrate, here's a screenshot of RegexBuddy's regex debugger on a simplified version of your data:

RegexBuddy Screen Shot

The engine gives up after 1 million permutations.

Your regex would be a little better like this (don't forget to escape the backslashes when converting it to a Java string):

(.*)(\.\s*)*+Dendryt

In this case the *+, a so-called possessive quantifier, will refuse to backtrack once it has matched. That way, the regex engine can fail much faster, but it's still bad because (.*) matches anything, even the dots.

([^.]*)(\.\s*)*+Dendryt

is safe, unless your data can contain dots before the "dotted line bit". All in all, please state your requirements a bit more clearly, then a better regex can be built.

Tim Pietzcker
Thanks a lot, it was very helpful! :)
Michal_R
+1  A: 

Try this:

"[^.]*+(?>\\.\\s*)*+Dendryt.*"

[^.]*+ consumes everything before the first dot, and the + makes the * possessive, so the regex will never backtrack past that point.

(?>\\.\\s*) is an atomic group: it matches a dot and any subsequent whitespace as if it were a single unit. If the regex engine has to backtrack to that point, it will then skip right over it to where the group started matching.

But it won't backtrack to that point, because I made the group's quantifier possessive, too. I wanted to illustrate the use of atomic groups, but I could have made the \\s* possessive instead--or both.

Possessive quantifiers and atomic groups disable backtracking completely, but it's not always possible to use them. When you have to allow backtracking, keep it to a minimum; don't let the quantifiers consume more than they have to. And especially, as Tim said, avoid nested quantifiers and quantified subexpressions that can match the same things.

In fact, it's good exercise to avoid the use of .* and .+; it forces you to think about the mechanics of it. If there isn't something specific you want to match, think about what you don't want to match, like when I used [^.]* in place of the first .* in your regex.

Alan Moore
Thanks a lot for help.
Michal_R