views:

82

answers:

2

I have some regexes in a Perl script that are correct but slow. I am considering trying to improve performance by adding extra + operators (ie *+ instead of * and ++ instead of +) to disable backtracking. I tried replacing all of them and the regexes stopped working... so much for the simple solution. How do I know where I can add them where it won't break the regex?

A: 

Hmmm... once I posted the question, looking at the "Related" column led me to this which has some pretty good ideas.... http://www.regular-expressions.info/catastrophic.html

JoelFan
+5  A: 

If the regexes stopped working, you either aren't using a version of perl that supports them, or you actually do need backtracking in those cases.

Identify sections of the regex that won't ever need backtracking (that is, that if asked to match starting at a given point, there will never be more than one length you might want them to match), and surround them with (?> ). This has the same effect as ++/*+ and is supported even pre-5.10.

Note that restricting backtracking is often not "optimization", since it changes what will and will not be matched. The idea is that you use it to better describe what you actually want matched. Borrowing from the article linked in the OP's answer, something like ^(.*?,){11}P (twelfth comma separated field starts P) is not just inefficient, it is incorrect, since backtracking will cause it to actually match even when only a field after the twelfth starts with P. By correcting it to ^(?>.*?,){11}P you are restricting it to actually matching the correct number of leading fields. (In this trivial case, ^([^,]*,){11}P also does the job, but if you add in support for escaped or quoted commas within fields using alternation, (?> becomes the easier choice.)

ysth