tags:

views:

43

answers:

2

I have a very uniform set of data from Radius messages that I need to add into our log management solution. The product offers the ability to use a regex statement to pull out the various data in a few forms.

1) Individual regular expressions for each piece of data you wish to pull out

    <data 1 = regex statement>
    <data 2 = different regex statement>    
    <data 2 = yet another regex statement>

2) A singular regular expression using capture groups

    <group = regex statement with capture groups>
        <data 1 = capture group[X]
        <data 2 = capture group[Y]
        <data 3 = capture group[Z]
    </group>

<158>Jul 6 14:33:00 radius/10.10.100.12 radius: 07/06/2010 14:33:00 AP1A-BLAH (10.10.10.10) - 6191 / Wireless - IEEE 802.11: abc1234 - Access-Accept (AP: 000102030405 / SSID: bork / Client: 050403020100) 

I want to pull out several bits of data, all of them between spaces. Something along the lines of the following doesn't seem efficient:

(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s(.*?)\s

So, given the data above, what's the most efficient Java Regex that will grab each field in between a set of spaces and put it into a capture group?

+2  A: 

You could be more specific:

(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s(\S*)\s

\S matches a non-space character - this makes the regex more efficient by avoiding backtracking, and it allows the regex to fail faster if the input doesn't fit the pattern.

I.e., when applying your regex to the string Jul 6 14:33:00 radius/10.10.100.12 radius: 07/06/2010, it takes the regex engine 2116 steps to find out that it can't match. The regex above fails in 168 steps.

Alan Moore's suggestion to use (\S*+)\s(\S*+)\s(\S*+)\s(\S*+)\s(\S*+)\s(\S*+)\s results in another improvement - now the regex fails within 24 steps (nearly a hundred times faster than the initial regex).

If the match is successful, Alan's and my solution are equivalent, your regex is about ten times slower.

Tim Pietzcker
You can take it step further and make all quantifiers possessive, i.e., `(\S*+)`. You can't get much more efficient than that.
Alan Moore
+1  A: 

I just thought of something else - why not simply split the string on whitespace?

String[] splitArray = subjectString.split("\\s");
Tim Pietzcker
The only reason I can't do this is the interface I'm offered is regex and doesn't include the ability to do fun things like that. The first answer is perfect though - the regex should be efficient enough when processing massive amounts of logs per second.
Chris