ansaurus

Question

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

Answer 1

A:

You have to tell it that it's OK if the first matching groups aren't there, but not the last one:

(\d{0,2}?)(\d{0,4}?)(\d{1,4})$

Matches your examples properly in my testing.

cnanney 2010-08-07 04:05:01

Wow... It does work... and I thought that when we specify {0,4} the zero would already mean "OK if not there"...

Marcio Gabe 2010-08-07 04:12:27

If this did it for you, please mark as the accepted answer.

cnanney 2010-08-07 04:28:39

Answer 2

A:

str = Regex.Replace(str, @"(\d{2})?(\d{4})(\d{4})$", "($1) $2-$3");

You don't want 0-4 or 1-4 characters in $2 and $3. They both are of fixed 4 digit length. $1 is optional. Change the regex as mentioned above.

Edit: I didn't notice that you can have 3 digits in the second part. The new regex would be as follows (Just as you've correctly mentioned already).

str = Regex.Replace(str, @"(\d{2})?(\d{3,4})(\d{4})$", "($1) $2-$3");

Hasan Khan 2010-08-07 04:10:17

For the phone problem I'll probably use your solution with one minor change: @"(\d{2})?(\d{3,4})(\d{4})$" since I'll accept 4 or 3 digits for the second group, so these are all valid # "(51) 3555-4444", "3555-4444" and "555-4444".As for the question, I'm really trying to understand better these "?" question marks, and greedy X lazy matches and all...

Marcio Gabe 2010-08-07 04:25:57

Answer 3

+1 A:

So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?

Use

^(\d{0,2}?)(\d{0,4})(\d{4})$

As a C# snippet, commented:

resultString = Regex.Replace(subjectString, 
  @"^             # anchor the search at the start of the string
    (\d{0,2}?)    # match as few digits as possible, maximum 2
    (\d{0,4})     # match up to four digits, as many as possible
    (\d{4})       # match exactly four digits
    $             # anchor the search at the end of the string", 
   "($1) $2-$3", RegexOptions.IgnorePatternWhitespace);

By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.

Without the ? in the first group, what would happen when trying to match 123456?

First, the \d{0,2} matches 12.

Then, the \d{0,4} matches 3456.

Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.

Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.

Tim Pietzcker 2010-08-07 04:20:36

The second matching group needs the '?' so it expands as needed (lazy), otherwise 54444 won't match. Tested in RegexBuddy.

cnanney 2010-08-07 04:33:35

Well... You are right on the money on your regex... still what I'm looking for is some rational explanation of why my fist expression didn't work as expected. It seems that I'm stumbling in the use of the "?" question mark versus the zero portion in the {0,4} thing...

Marcio Gabe 2010-08-07 04:34:22

@cnanney: It seems I need to play with this RegexBuddy to learn some more :)

Marcio Gabe 2010-08-07 04:36:48

@cnanney: The second group doesn't need a `?`. Why shouldn't it match `54444` (also tested in RegexBuddy, by the way)?

Tim Pietzcker 2010-08-07 05:01:03

@Tim: I didn't put the "?" in the second group, and it works just like expected. I really like your commented C# format. The one change I did, for testing is not to anchor the search at the start. This way, if there are more digits, it will flow over to the left like this: "1(51) 3555-4444". Good stuff... :) Again, not looking into the actual validation of the phone numbers themselves... just trying to force the numbers into this kind of "( ) - " mask and learn some more about regex. Thanks!

Marcio Gabe 2010-08-07 05:08:31

WOW!!!!!!!!!!!!!!! You are the man!! :D... Your edit a while ago, just explained to me the whole thing in a way that really makes sense. (This behavior of first group having content while second didn't and third had it as well is what was puzzling to me) Now the whole start matching, and backtracking (probably because of the anchor at the end) is making perfect sense to me! This explains it all!

Marcio Gabe 2010-08-07 05:13:38

Using your regex that was editing 8 min ago, it would not match '3554444'. Without the ? in the second group, it will match as many times as possible (4) and leave only 3 digits for the third matching group, which you've said requires 4, thus the whole match will fail. BTW, how do you highlight code inside comments? Wrapping in code tags didn't work.

cnanney 2010-08-07 05:23:24

http://i.imgur.com/jg1Wb.png

cnanney 2010-08-07 05:32:22

Ha... OK so nevermind. You are right about not needing the '?'. I have to admit, I wasn't entirely sure why RegexBuddy was saying yours didn't work, so I tried to reason it the best I could (which was wrong). You have to believe RegexBuddy, right? It wouldn't lie to me! Actually it did, I had an older version that must have some bug in it. I just upgraded and you sir are indeed correct. +1 for you.

cnanney 2010-08-07 05:56:55

Before you posted your comment about the second "?" mark, I was wondering, because I was testing it in the actual Visual Studio, and it worked without needing the second "?"... so... this explains it. Well... thank you all!... can I vote you up some more? :)

Marcio Gabe 2010-08-07 06:25:35

ansaurus

tags:

views:

answers:

C# Regex Replace weird behavior with multiple captures and matching at the end of string?

related questions