tags:

views:

233

answers:

3

Hey guys, I have a regular expression that is pretty long, and is hard to look at. i was wondering if you could help shorten it up, so it's more manageable. I admit, I'm not a regexp guru, and I just hack away to get by. If you come up with something better (it doesn't even have to be shorter), please explain your reasoning, so I might have a better understanding of the techniques you use.

Regex:

^([a-zA-Z0-9# ]+)-([a-zA-Z ]*)([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z0-9_ ]+)-([a-zA-Z ~]+)([a-zA-Z0-9_ ]+)\.rpt$

Tests:

TESTFIX - ABCD 10118 - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 91.rpt
TESTFIX - EFGD 10118 - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 92.rpt
TESTFIX - 10118_14041 M - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 93.rpt
TESTFIX - ABCD 10118 - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 93.rpt
TESTFIX - EFGD 10118 - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 93.rpt
TESTFIX - EFGD 10118 - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 93.rpt
TESTFIX - ABCD 10118 - E008 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ 93.rpt
#1REALLYLONGNAME - 10244 - E011 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - DX ~ ALPHALTR.rpt
#1 LIVEREP - 10045 - E011 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ SING.rpt
#2 LIVEREP - 10045 M - E011 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ MUL.rpt
WELLREP - WELL10000 - E011 - E009 - IXX - IXX - IXX - IXX - IXX - IXX - SX ~ CLT.rpt

each section is split up by the ' - ' sequence of characters. All sections can contain spaces, and any valid file name character

There has to be group capturing for each section If it matters, I'll be using this regexp in C#

+5  A: 

First of all, get a good regular expression development tool. My favorite is Expresso.

Here is a cleaned up version:

^[\w# ]+ - [a-zA-Z ]*(?:[\w_ ]+ - ){9}[a-zA-Z]+ ~[\w_ ]+\.rpt$

Changes include:

  • Removed the capture groupings "()" - I'll assume that you're only validating the text since you didn't mention any capturing. If you need them, they're easy enough to add back
  • Use of alphanumeric character class - "\w" which is equivalent to "[a-zA-Z0-9]"
  • Replaced the repeated portion in the middle with "(?:[\w_ ]+ - ){9}" This matches ([alphanumeric underscore space]+ - ) nine times. It doesn't capture because of the "?:" I put after the first parenthesis.

EDIT:

Here it is with the capture groups back:

^([\w# ]+) - ([a-zA-Z ]*)(?:([\w_ ]+) - ){9}([a-zA-Z]+) ~ ([\w_ ]+)\.rpt$

Note that when you go through the numbered capture groups, the third one will have 9 captures in it.

James Kolpack
I need the capture groups.
Michael G
Note that _ is also included in \w
rob
+3  A: 

You can replace every instance of a-zA-Z0-9_ with \w. Also, 0-9 can be slightly shortened to \d.

Here are the character classes supported by C#: http://msdn.microsoft.com/en-us/library/20bw873z%28VS.71%29.aspx

You can make a group non-capturing by including ?: at the beginning after the opening parenthesis for the group. If you have an expression that repeats a known number of times, you can follow it with {n}:

^([a-zA-Z\d# ]+)-([a-zA-Z ]*)(?:([\w ]+)-){9}([a-zA-Z ~]+)([\w ]+)\.rpt$

rob
+4  A: 

When you are talking about "simplifying" Regular Expressions, you really need to also know what you don't want to match, as that can really help simplify your tests with special characters, sequence repetition, etc.

That said, here is a cleaned up version that is produces exactly the same result as your original expression:

^([a-zA-Z0-9# ]+)-([a-zA-Z ]*)(?:([\w ]+)-){9}([a-zA-Z ~]+)([\w ]+)\.rpt$

Some notes on why this differs from the other posted answer:

  • According to my reference for Perl-compatible regular expressions, \w actually also includes underscore. (Edit: this is apparently different from C# which is explained in the link to MSDN. This difference may be useful to note.)
  • My expression assumes you had the spaces in the character classes on purpose. If, in fact, you can have multiple spaces between dashes, leave it this way, otherwise, go with the other answer.
Renesis
`\w` includes the underscore in C#, too (it's covered by the Unicode property `\p{Pc}`).
Alan Moore