views:

97

answers:

6

Hi All,

my question is related to Regular Expressions in Java, and in particular, multiple matches for a given search pattern. All of the info i need to get is on 1 line and it contains an alias (e.g. SA) which maps to an IP address. Each one is separated by a comma. I need to extract each one.

SA "239.255.252.1", SB "239.255.252.2", SC "239.255.252.3", SD "239.255.252.4"

My Reg Ex looks like this:

Pattern alias = Pattern.compile("(\\S+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");  
Matcher match = alias.matcher(lineInFile)  
while(match.find()) {  
do something  
}

This works but I'm not totally happy with it because since introducing this small piece of code, my program has slowed down a bit (< 1 sec) but enough to notice a difference.

So my question is, am I going about this in the correct manner? Is there a more efficient or possibly lightweight solution without the need for a while(match) loop? and/or Pattern/Matcher classes?

Many thanks in advance

J

+1  A: 

If the line may not contain anything except that alias definition, then using .match() instead of .find() might speed up the searching on non-matches.

Joachim Sauer
good idea (+1) ...
seanizer
A: 

I'm afraid your code looks pretty efficient already. Here's my version:

Matcher match = Pattern
                .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"")
                .matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

There are two micro-optimizations:

  1. No need to keep pattern in an extra variable, inlined that
  2. For the alias, search for word characters, not non-space characters

Actually, if you do a lot of processing like this and the pattern never changes, you should keep the compiled pattern in a constant:

private static final Pattern PATTERN = Pattern
            .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");

Matcher match = PATTERN.matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

Update: I took some time on RegExr to come up with a much more specific pattern, which should only detect valid IP addresses as a bonus. I know it's ugly as hell, but my guess is that it's pretty efficient, as it eliminates most of the backtracking:

([A-Z]+)\s*\"((?:1[0-9]{2}|2(?:(?:5[0-5]|[0-9]{2})|[0-9]{1,2})\.)
{3}(?:1[0-9]{2}|2(?:5[0-5]|[0-9]{2})|[0-9]{1,2}))

(Wrapped for readability, all back-slashes need to be escaped in java, but you can test it on RegExr as it is with the OP's test string)

seanizer
The effect of the first micro-optimization is probably too small to measure. The second one changes the meaning of the regex, and it is not clear that it helps much. However, precompiling and reusing the pattern is definitely worthwhile.
Stephen C
Re the second micro-optimization: looking at the code of the Pattern class, I think this might actually slow pattern matching down!
Stephen C
Yes, That makes sense. Finding the negation of a small character class should be more efficient than finding a lqrge character class like \w. I'll update my answer soon
seanizer
I found out during my tests that \w+ is 15-20% faster than \S+.
splash
Well maybe then I'll leave it in after all :-)
seanizer
A: 

You can improve your regex to: "(\\S{2})\\s+\"((\\d{1,3}\\.){3}\\d{1,3})\"" by specifying an IP address more explicitly.

Try out the performance of using a StringTokenizer. It does not use regular expressions. (If you are concerned about using a legacy class, then take a look at its source and see how it is done.)

StringTokenizer st = new StringTokenizer(lineInFile, " ,\"");
while(st.hasMoreTokens()){
    String key = st.nextToken();
    String ip = st.nextToken();
    System.out.println(key + " ip: " +  ip);
}
dogbane
*StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.* (Source: http://download.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html )
seanizer
That said, Scanner may be a good alternative: http://download-llnw.oracle.com/javase/6/docs/api/java/util/Scanner.html
seanizer
Yes, I know. That's why I put a note in my post. StringTokenizer uses String's `indexOf` and `substring` methods internally so we can see how it works and replicate its functionality in our new code if it is faster than regex.
dogbane
`Scanner` uses regex.
dogbane
*That's why I put a note in my post* I missed that. This must be my blind day, sorry.
seanizer
A: 

I don't know if this will yield a big performance benefit, but you could also first do

string.split(", ") // separate groups

and then

string.split(" ?\"") // separate alias from IP address

on the matches.

Tim Pietzcker
So two regex passes will be faster than one? I doubt it.
seanizer
@seanizer: I'm doubtful, too. I don't use Java, so I can't profile it. But it might be worth a try.
Tim Pietzcker
A: 

Precompiling and reusing the Pattern object is (IMO) likely to be the most effective optimization. Pattern compilation is potentially an expensive step.

Reusing the Matcher instance (e.g. using reset(CharSequence)) might help, but I doubt that it will make much difference.

The regex itself cannot be optimized significantly. One possible speedup would be to replace (\d+\.\d+\.\d+\.\d+) with ([0-9\.]+). This might help because it reduces the number of potential backtrack points ... but you'd need to do some experiments to be sure. And the obvious downside is that it matches character sequences that are not valid IP addresses.

Stephen C
Reusing the matcher sounds like a bad idea, because it would really break things in a multithreaded scenario (unless you introduce object pools and *that* would really be overkill)
seanizer
A: 

If you`re noticing a difference of < 1 sec on that piece of code, then your input string must contain around a million (ot at least some 100k) of entries. I think that's a pretty fair performance and I cannot see how you could significantly optimize that without writing your own specialized parser.

splash