views:

138

answers:

3

I'm trying to write something that format Brazilian phone numbers, but I want it to do it matching from the end of the string, and not the beginning, so it would turn input strings according to the following pattern:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "3555-4444"
"5554444" -> "555-4444"

Since the begining portion is what usually changes, I thought of building the match using the $ sign so it would start at the end, and then capture backwards (so I thought), replacing then by the desired end format, and after, just getting rid of the parentesis "()" in front if they were empty.

This is the C# code:

s = "5135554444";
string str = Regex.Replace(s, @"\D", ""); //Get rid of non digits, if any
str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{1,4})$", "($1) $2-$3");
return Regex.Replace(str, @"^\(\) ", ""); //Get rid of empty () at the beginning

The return value was as expected for a 10 digit number. But for anything less than that, it ended up showing some strange behavior. These were my results:

"5135554444" -> "(51) 3555-4444"
"35554444" -> "(35) 5544-44"
"5554444" -> "(55) 5444-4"

It seems that it ignores the $ at the end to do the match, except that if I test with something less than 7 digits it goes like this:

"554444" -> "(55) 444-4"
"54444" -> "(54) 44-4"
"4444" -> "(44) 4-4"

Notice that it keeps the "minimum" {n} number of times of the third capture group always capturing it from the end, but then, the first two groups are capturing from the beginning as if the last group was non greedy from the end, just getting the minimum... weird or it's me?

Now, if I change the pattern, so instead of {1,4} on the third capture I use {4} these are the results:

str = Regex.Replace(str, @"(\d{0,2})(\d{0,4})(\d{4})$", "($1) $2-$3");

"5135554444" -> "(51) 3555-4444" //As expected
"35554444" -> "(35) 55-4444" //The last four are as expected, but "35" as $1?
"54444" -> "(5) -4444" //Again "4444" in $3, why nothing in $2 and "5" in $1?

I know this is probably some stupidity of mine, but wouldn't it be more reasonable if I want to capture at the end of the string, that all previous capture groups would be captured in reverse order?

I would think that "54444" would turn into "5-4444" in this last example... then it does not...

How would one accomplish this?

(I know maybe there's a better way to accomplish the very same thing using different approaches... but what I'm really curious is to find out why this particular behavior of the Regex seems odd. So, the answer tho this question should focus on explaining why the last capture is anchored at the end of the string, and why the others are not, as demonstrated in this example. So I'm not particularly interested in the actual phone # formatting problem, but to understand the Regex sintax)...

Thanks...

A: 

You have to tell it that it's OK if the first matching groups aren't there, but not the last one:

(\d{0,2}?)(\d{0,4}?)(\d{1,4})$

Matches your examples properly in my testing.

cnanney
Wow... It does work... and I thought that when we specify {0,4} the zero would already mean "OK if not there"...
Marcio Gabe
If this did it for you, please mark as the accepted answer.
cnanney
A: 
str = Regex.Replace(str, @"(\d{2})?(\d{4})(\d{4})$", "($1) $2-$3");

You don't want 0-4 or 1-4 characters in $2 and $3. They both are of fixed 4 digit length. $1 is optional. Change the regex as mentioned above.

Edit: I didn't notice that you can have 3 digits in the second part. The new regex would be as follows (Just as you've correctly mentioned already).

str = Regex.Replace(str, @"(\d{2})?(\d{3,4})(\d{4})$", "($1) $2-$3");
Hasan Khan
For the phone problem I'll probably use your solution with one minor change: @"(\d{2})?(\d{3,4})(\d{4})$" since I'll accept 4 or 3 digits for the second group, so these are all valid # "(51) 3555-4444", "3555-4444" and "555-4444".As for the question, I'm really trying to understand better these "?" question marks, and greedy X lazy matches and all...
Marcio Gabe
+1  A: 

So you want the third part to always have four digits, the second part zero to four digits, and the first part zero to two digits, but only if the second part contains four digits?

Use

^(\d{0,2}?)(\d{0,4})(\d{4})$

As a C# snippet, commented:

resultString = Regex.Replace(subjectString, 
  @"^             # anchor the search at the start of the string
    (\d{0,2}?)    # match as few digits as possible, maximum 2
    (\d{0,4})     # match up to four digits, as many as possible
    (\d{4})       # match exactly four digits
    $             # anchor the search at the end of the string", 
   "($1) $2-$3", RegexOptions.IgnorePatternWhitespace);

By adding a ? to a quantifier (??, *?, +?, {a,b}?) you make it lazy, i. e. tell it to match as few characters as possible while still allowing an overall match to be found.

Without the ? in the first group, what would happen when trying to match 123456?

First, the \d{0,2} matches 12.

Then, the \d{0,4} matches 3456.

Then, the \d{4} doesn't have anything left to match, so the regex engine backtracks until that's possible again. After four steps, the \d{4} can match 3456. The \d{0,4} gives up everything it had matched greedily for this.

Now, an overall match has been found - no need to try any more combinations. Therefore, the first and third groups will contain parts of the match.

Tim Pietzcker
The second matching group needs the '?' so it expands as needed (lazy), otherwise 54444 won't match. Tested in RegexBuddy.
cnanney
Well... You are right on the money on your regex... still what I'm looking for is some rational explanation of why my fist expression didn't work as expected. It seems that I'm stumbling in the use of the "?" question mark versus the zero portion in the {0,4} thing...
Marcio Gabe
@cnanney: It seems I need to play with this RegexBuddy to learn some more :)
Marcio Gabe
@cnanney: The second group doesn't need a `?`. Why shouldn't it match `54444` (also tested in RegexBuddy, by the way)?
Tim Pietzcker
@Tim: I didn't put the "?" in the second group, and it works just like expected. I really like your commented C# format. The one change I did, for testing is not to anchor the search at the start. This way, if there are more digits, it will flow over to the left like this: "1(51) 3555-4444". Good stuff... :) Again, not looking into the actual validation of the phone numbers themselves... just trying to force the numbers into this kind of "( ) - " mask and learn some more about regex. Thanks!
Marcio Gabe
WOW!!!!!!!!!!!!!!! You are the man!! :D... Your edit a while ago, just explained to me the whole thing in a way that really makes sense. (This behavior of first group having content while second didn't and third had it as well is what was puzzling to me) Now the whole start matching, and backtracking (probably because of the anchor at the end) is making perfect sense to me! This explains it all!
Marcio Gabe
Using your regex that was editing 8 min ago, it would not match '3554444'. Without the ? in the second group, it will match as many times as possible (4) and leave only 3 digits for the third matching group, which you've said requires 4, thus the whole match will fail. BTW, how do you highlight code inside comments? Wrapping in code tags didn't work.
cnanney
http://i.imgur.com/jg1Wb.png
cnanney
Ha... OK so nevermind. You are right about not needing the '?'. I have to admit, I wasn't entirely sure why RegexBuddy was saying yours didn't work, so I tried to reason it the best I could (which was wrong). You have to believe RegexBuddy, right? It wouldn't lie to me! Actually it did, I had an older version that must have some bug in it. I just upgraded and you sir are indeed correct. +1 for you.
cnanney
Before you posted your comment about the second "?" mark, I was wondering, because I was testing it in the actual Visual Studio, and it worked without needing the second "?"... so... this explains it. Well... thank you all!... can I vote you up some more? :)
Marcio Gabe