ansaurus

Question

Java Regular Expressions using Pattern and Matcher

Answer 1

+1 A:

If the line may not contain anything except that alias definition, then using .match() instead of .find() might speed up the searching on non-matches.

Joachim Sauer 2010-09-29 09:28:46

good idea (+1) ...

seanizer 2010-09-29 09:38:44

Answer 2

A:

I'm afraid your code looks pretty efficient already. Here's my version:

Matcher match = Pattern
                .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"")
                .matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

There are two micro-optimizations:

No need to keep pattern in an extra variable, inlined that
For the alias, search for word characters, not non-space characters

Actually, if you do a lot of processing like this and the pattern never changes, you should keep the compiled pattern in a constant:

private static final Pattern PATTERN = Pattern
            .compile("(\\w+)\\s+\"(\\d+\\.\\d+\\.\\d+\\.\\d+)\"");

Matcher match = PATTERN.matcher(lineInFile);  
while(match.find()) {  
    //do something  
}

Update: I took some time on RegExr to come up with a much more specific pattern, which should only detect valid IP addresses as a bonus. I know it's ugly as hell, but my guess is that it's pretty efficient, as it eliminates most of the backtracking:

([A-Z]+)\s*\"((?:1[0-9]{2}|2(?:(?:5[0-5]|[0-9]{2})|[0-9]{1,2})\.)
{3}(?:1[0-9]{2}|2(?:5[0-5]|[0-9]{2})|[0-9]{1,2}))

(Wrapped for readability, all back-slashes need to be escaped in java, but you can test it on RegExr as it is with the OP's test string)

seanizer 2010-09-29 09:36:44

The effect of the first micro-optimization is probably too small to measure. The second one changes the meaning of the regex, and it is not clear that it helps much. However, precompiling and reusing the pattern is definitely worthwhile.

Stephen C 2010-09-29 10:47:01

Re the second micro-optimization: looking at the code of the Pattern class, I think this might actually slow pattern matching down!

Stephen C 2010-09-29 11:01:51

Yes, That makes sense. Finding the negation of a small character class should be more efficient than finding a lqrge character class like \w. I'll update my answer soon

seanizer 2010-09-29 11:27:10

I found out during my tests that \w+ is 15-20% faster than \S+.

splash 2010-09-29 11:49:51

Well maybe then I'll leave it in after all :-)

seanizer 2010-09-29 11:56:13

Answer 3

A:

You can improve your regex to: "(\\S{2})\\s+\"((\\d{1,3}\\.){3}\\d{1,3})\"" by specifying an IP address more explicitly.

Try out the performance of using a StringTokenizer. It does not use regular expressions. (If you are concerned about using a legacy class, then take a look at its source and see how it is done.)

StringTokenizer st = new StringTokenizer(lineInFile, " ,\"");
while(st.hasMoreTokens()){
    String key = st.nextToken();
    String ip = st.nextToken();
    System.out.println(key + " ip: " +  ip);
}

dogbane 2010-09-29 09:38:46

*StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.* (Source: http://download.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html )

seanizer 2010-09-29 09:41:04

That said, Scanner may be a good alternative: http://download-llnw.oracle.com/javase/6/docs/api/java/util/Scanner.html

seanizer 2010-09-29 09:44:10

Yes, I know. That's why I put a note in my post. StringTokenizer uses String's `indexOf` and `substring` methods internally so we can see how it works and replicate its functionality in our new code if it is faster than regex.

dogbane 2010-09-29 09:46:43

`Scanner` uses regex.

dogbane 2010-09-29 09:48:00

*That's why I put a note in my post* I missed that. This must be my blind day, sorry.

seanizer 2010-09-29 10:27:59

Answer 4

A:

I don't know if this will yield a big performance benefit, but you could also first do

string.split(", ") // separate groups

and then

string.split(" ?\"") // separate alias from IP address

on the matches.

Tim Pietzcker 2010-09-29 10:00:01

So two regex passes will be faster than one? I doubt it.

seanizer 2010-09-29 10:25:55

@seanizer: I'm doubtful, too. I don't use Java, so I can't profile it. But it might be worth a try.

Tim Pietzcker 2010-09-29 12:14:29

Answer 5

A:

Precompiling and reusing the Pattern object is (IMO) likely to be the most effective optimization. Pattern compilation is potentially an expensive step.

Reusing the Matcher instance (e.g. using reset(CharSequence)) might help, but I doubt that it will make much difference.

The regex itself cannot be optimized significantly. One possible speedup would be to replace (\d+\.\d+\.\d+\.\d+) with ([0-9\.]+). This might help because it reduces the number of potential backtrack points ... but you'd need to do some experiments to be sure. And the obvious downside is that it matches character sequences that are not valid IP addresses.

Stephen C 2010-09-29 11:13:28

Reusing the matcher sounds like a bad idea, because it would really break things in a multithreaded scenario (unless you introduce object pools and *that* would really be overkill)

seanizer 2010-09-29 15:51:38

Answer 6

A:

If you`re noticing a difference of < 1 sec on that piece of code, then your input string must contain around a million (ot at least some 100k) of entries. I think that's a pretty fair performance and I cannot see how you could significantly optimize that without writing your own specialized parser.

splash 2010-09-29 11:42:13

ansaurus

tags:

views:

answers:

Java Regular Expressions using Pattern and Matcher

related questions