views:

150

answers:

3

Hi,

I am trying to use regular expressions to determine what format the user have applied when entering input in a textbox.
The regular expressions are as follows:

(\\s?[" + alphabet + "]{9,9})+

To determine whether the input is one or more strings of length 9 in a given alphabet, possibly separated by whitespace.

(>[\\w\\s]+\\n[" + alphabet + "\\s]+)+

To check if the input is in FASTA format

The regular expressions run terribly slow when matching with inputString.matches(regexString). Why is this?

I figured this may be due to Java storing all potential matches (which I don't need at this point), but adding ?: in every parenthesis breaks the regex. How should this be done?

Thank you,

Martin

Edit 1: I was unable to reproduce this issue - it only happens on one computer. This could suggest something wrong with that particular VM setup.
We need something more robust, and so we will be implementing this differently. I have picked Joel's answer as the right one, since I believe that some special case in Pattern may be the cause.

+1  A: 

string.matches() compile the regex every time you do it. Instead, look at the Pattern/Matcher classes, which allow you to cache precompiled regexes.

Another thing is to use non-capturing regex groups if you don't need the result of the matching.

ddimitrov
I only call matches() once, so that should not be a problem. The regexes perform well with small input, but horrendously slowly with input of more that, say, 200 characters.I was unable to get non-capturing groups to work - can you give an example?
Martin Wiboe
Switching to non-capturing groups won't give you a factor of 1000 improvement. Still, this is how you do it - put ?: after the opening parenthesis - example: (?:\\s?[" + alphabet + "]{9,9})+
ddimitrov
A: 

If you have a number of different regular expression patterns that are being matched against the same input to try to categorize the input, then you are likely to be better off using a lexical analyzer generator like JFlex.

Other Java-based lexical analyzer and parsing tools that are typically used in compiler construction can be found listed here.

Joel Hoff
Those are useful tools, thanks! But I only have 2 regexes, and this should be very simple - I think using JFlex and the like would be overkill.
Martin Wiboe
@Martin - It does sound like JFlex or the like would be more than normally needed in this case. However, in looking more closely at the regular expressions you are using, it is possible that your case is exposing some degenerate case in how the Pattern class compiles its analyzer. It may be worth trying JFlex just to see if it can produce a tighter analyzer for this scenario.
Joel Hoff
+1  A: 

this might not explain your particular problem. but once I dived into JDK's regex implementation, and I was surprised at how unsophisticated it is. it doesn't really build a state machine that advances at each input char. I assume they have their reasons.

in your case, it is so easy to write a parse by yourself, by hand. people fear to do that, it seems "dumb" to manually code these tiny steps, and people think established libraries must be doing some splendid tricks to outperform home grown solutions. that's not true. in many cases, our needs are rather simple, and it is simpler and faster to DIY.

irreputable
The Pattern.compile() method does build state machine of the descendends of Pattern.Node class. Do you mean that it builds a NFA automaton as opposed to DFA? That is the design used by most feature-rich regex engines, trading speed for features. Here is an article explaining this and suggesting alternative: http://weblogs.java.net/blog/2006/03/27/faster-java-regex-package
ddimitrov