ansaurus

Question

Why do these regular expressions execute slowly in Java?

Answer 1

+1 A:

string.matches() compile the regex every time you do it. Instead, look at the Pattern/Matcher classes, which allow you to cache precompiled regexes.

Another thing is to use non-capturing regex groups if you don't need the result of the matching.

ddimitrov 2010-06-27 14:03:10

I only call matches() once, so that should not be a problem. The regexes perform well with small input, but horrendously slowly with input of more that, say, 200 characters.I was unable to get non-capturing groups to work - can you give an example?

Martin Wiboe 2010-06-27 14:07:10

Switching to non-capturing groups won't give you a factor of 1000 improvement. Still, this is how you do it - put ?: after the opening parenthesis - example: (?:\\s?[" + alphabet + "]{9,9})+

ddimitrov 2010-06-27 23:14:13

Answer 2

A:

If you have a number of different regular expression patterns that are being matched against the same input to try to categorize the input, then you are likely to be better off using a lexical analyzer generator like JFlex.

Other Java-based lexical analyzer and parsing tools that are typically used in compiler construction can be found listed here.

Joel Hoff 2010-06-27 14:40:24

Those are useful tools, thanks! But I only have 2 regexes, and this should be very simple - I think using JFlex and the like would be overkill.

Martin Wiboe 2010-06-27 15:31:22

@Martin - It does sound like JFlex or the like would be more than normally needed in this case. However, in looking more closely at the regular expressions you are using, it is possible that your case is exposing some degenerate case in how the Pattern class compiles its analyzer. It may be worth trying JFlex just to see if it can produce a tighter analyzer for this scenario.

Joel Hoff 2010-06-27 16:00:22

Answer 3

+1 A:

this might not explain your particular problem. but once I dived into JDK's regex implementation, and I was surprised at how unsophisticated it is. it doesn't really build a state machine that advances at each input char. I assume they have their reasons.

in your case, it is so easy to write a parse by yourself, by hand. people fear to do that, it seems "dumb" to manually code these tiny steps, and people think established libraries must be doing some splendid tricks to outperform home grown solutions. that's not true. in many cases, our needs are rather simple, and it is simpler and faster to DIY.

irreputable 2010-06-28 01:32:08

The Pattern.compile() method does build state machine of the descendends of Pattern.Node class. Do you mean that it builds a NFA automaton as opposed to DFA? That is the design used by most feature-rich regex engines, trading speed for features. Here is an article explaining this and suggesting alternative: http://weblogs.java.net/blog/2006/03/27/faster-java-regex-package

ddimitrov 2010-06-29 01:51:32

ansaurus

tags:

views:

answers:

Why do these regular expressions execute slowly in Java?

related questions