tags:

views:

2000

answers:

3

Is it possible to skip a couple of characters in a capture group in regular expressions? I am using .NET regexes but that shouldn't matter.

Basically, what I am looking for is:

[random text]AB-123[random text]

and I need to capture 'AB123', without the hyphen.

I know that AB is 2 or 3 uppercase characters and 123 is 2 or 3 digits, but that's not the hard part. The hard part (at least for me) is skipping the hyphen.

I guess I could capture both separately and then concatenate them in code, but I wish I had a more elegant, regex-only solution.

Any suggestions?

+7  A: 

In short: You can't. A match is always consecutive, even when it contains things as zero-width assertions there is no way around matching the next character if you want to get to the one after it.

Tomalak
+6  A: 

There really isn't a way to create an expression such that the matched text is different than what is found in the source text. You will need to remove the hyphen in a separate step either by matching the first and second parts individually and concatenating the two groups:

match = Regex.Match( text, "([A-B]{2,3})-([0-9]{2,3})" );
matchedText = string.Format( "{0}{1}", 
    match.Groups.Item(1).Value, 
    match.Groups.Item(2).Value );

Or by removing the hyphen in a step separate from the matching process:

match = Regex.Match( text, "[A-B]{2,3}-[0-9]{2,3}" );
matchedText = match.Value.Replace( "-", "" );
Jeff Hillman
+1  A: 

Your assertion that its not possible to do without sub-grouping + concatentating it is correct.

You could also do as Jeff-Hillman and merely strip out the bad character(s) after the fact.

Important to note here tho, is you "dont use regex for everything".

Regex is designed for less complicated solutions for non-trivial problems, and you shouldn't use "oh, we'll use a regex" for everything, and you shoudn't get into the habbit of thinking you can solve the problem in a one-step regex.

When there is a viable trivial method that works, by all means, use it.

An alternative Idea, if you happen to be needing to return multiple matches in a body of code is look for your languages "callback" based regex, which permits passing any matched/found group to a function call which can do in-line substitution. ( Especially handy in doing regexp replaces ).

Not sure how it would work in .Net, but in php you would do something like ( not exact code )

  function strip_reverse( $a )
  {
     $a = preg_replace("/-/", "", $a );
     return reverse($a);
  }
  $b = preg_replace_callback( "/(AB[-]?cde)/" , 'strip_reverse' , "Hello World AB-cde" ;
Kent Fredric
It is a common misunderstanding that regex is for "less complicated siutations" only. Regex is immensely powerful and con solve really complex stuff. Regex is just not the right tool for things that are not regular. It's simple: There are things that work with regex, and there are those that don't.
Tomalak
yes, but theres a prolific /overuse/ of regex in situations where the solution is using a firearm to holepunch paper. it'll work, but there are complications that don't exist in the simpler solution. The key is knowing when *not* to use regex ;)
Kent Fredric
Knowing when to use which tool is always the key. I would probably avoid using regex in a long loop when there was another way (say, "indexOf" plus a little math).
Tomalak
For those cases there is the "study regex" optimisation which makes a memory tree to boost regex matching ;)
Kent Fredric