views:

43

answers:

3

I've got a working regular expression, but I'd like to make it a tad more readable, and I'm far from a regex guru, so I was humbly hoping for some tips.

This is designed to scrape the output of several different compilers, linkers, and other build tools, and is used to build a nice little summery report. It does it's job great, but I'm left feeling like I wrote it in a clunky fashion, and I'd sooner learn than keep it the wrong way.

(.*?)\s?:?\s?(informational|warning|error|fatal error)?\s([A-Z]+[0-9][0-9][0-9][0-9]):\s(.*)$

Which, broken down simply, is as follows:

(.*?)                                       # non-greedily match up until...
\s?:?\s?                                    # we come across a possible " : "
(informational|warning|error|fatal error)?  # possibly followed by one of these
\s([A-Z]+[0-9][0-9][0-9][0-9]):\s           # but 100% followed by this alphanum
(.*)$                                       # and then capture the rest

I'm mostly interested in making the 2nd and 4th entry above more... beautiful. For some reason, the regex tester I was using (The Regulator) didn't match plain spaces, so I had to use the \s... but it is not meant to match any other whitespace.

Any schooling will be greatly appreciated.

+1  A: 

The easiest way to make a long regex more readable is to use the "free-spacing" (or \x) modifier, which would let you write your regex just like you did in the second block of code -- it makes whitespace ignored. This isn't supported by all engines, however (according to the page linked above, .NET, Java, Perl, PCRE, Python, Ruby and XPath support it).

Note also that in free-spacing mode, you can use [ ] instead of \s if you want to only match a space character (unless you're using Java, in which case you have to use , which is an escaped space).

There's not really anything you can do for the second line, if you want each element to be optional independently of the other elements, but the fourth can be shortened:

\s([A-Z]+\d{4}):\s

\d is a shorthand class equivalent to [0-9], and {4} specifies that it should appear exactly four times.

The third line can be slightly shortened as well ((?:…) specifies a non-capturing group):

(informational|warning|(?:fatal )? error)?

From an efficiency standpoint, unless you actually need to capture subpatterns each time you use brackets, you can remove all of them, except for on the third line, where the group is needed for the alternation) -- but that one can be made non-capturing. Putting this all together you'd get:

.*?
\s?:?\s?
(?:informational|warning|(?:fatal )?error)?
\s[A-Z]+\d{4}:\s
.*$
Daniel Vandersluis
I'm on board with the \x... although I didn't use it here (having just found out about \x oh... yesterday :). I'm more interested if there is actually better regex syntax to use than what I used for lines 2 and 4
Nate
+1  A: 

Line 2

I think your regular expression doesn't match with the comment. You probably want this instead:

(\s:\s)?

To make it non-capturing:

(?:\s:\s)?

You should be able to use a literal space instead of \s. This must be a restriction in the tool you are using.

Line 4

[0-9][0-9][0-9][0-9] can be replaced with [0-9]{4}.

In some languages [0-9] is equivalent to \d.

Mark Byers
I like the non-capturing bit... on a side note, is it possible to mandate one of those two optional groups? i.e., one or the other or both, but not neither?
Nate
@Nate: I think this is about the best way to do that: `((informational|warning|error|fatal error)(\s:\s)?|\s:\s)`
Mark Byers
A: 

Perhaps you can build the RE from sub-expressions, so that your end RE would look something like this:

 /$preamble$possible_colon$keyword$alphanum$trailer/
zigdon