tags:

views:

71

answers:

2

I have a string containing address and a phone number (US format; (xxx) xxx-xxxx). E.g.,

1243 K. Beverly Bld. # 223
Los Angeles, CA 41124
(213) 314-3221

This is a single string, I need to extract phone number out of it using regex. I could have used string tokens, but there are chances that some invalid data is also concatenated with this string. So I think using regular expression would be the easiest and fastest way to find a phone number. After finding the phone number, I need to remove from the input string.

Can someone please share the quick-code-snippet?

+1  A: 
Match matchResults = null;
try {
    Regex regexObj = new Regex(@"\(?\b[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}\b");
    matchResults = regexObj.Match(subjectString);
    if (matchResults.Success) {
        // matched text: matchResults.Value
        // match start: matchResults.Index
        // match length: matchResults.Length
        // backreference n text: matchResults.Groups[n].Value
        // backreference n start: matchResults.Groups[n].Index
        // backreference n length: matchResults.Groups[n].Length
    } else {
        // Match attempt failed
    } 
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

I got this snippet from RegexBuddy, a very good helper for RegEx.

labilbe
Thanks mate! However, I am using following regex:((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}
effkay
This is good, but won't work if an area code isn't provided, and also won't work if the input contains an extension number following the letter "x" (since `\b` won't see a word boundary). The accepting of any digit in the area code and exchange is also too promiscuous--the allowed digits are slightly more restrictive in a few places (sure, that's picky, but like using regex for IP addresses, number ranges captured should ideally be restricted to only permissible values).
richardtallent
@richardtallent: thanks; but the main purpose was served by labilbe's solution. About the solution you have pasted, it is awesome as well!
effkay
+2  A: 

This will work for numbers in the US:

 ^                         # beginning of string, or BOL in multi-line mode
 (?:[+]?1[-. ]){0,1}       # optional calling code, not captured
 \(?                       # optional common prefix for area code, not captured
 ([2-9][0-8][0-9])?        # optional NANP-allowed area codes, captured in $1
 [)-. ]*                   # optional common delimiters after area code, not captured
 (                         # begin capture group $2 for exchange code
  [2-9]                    # first digit cannot be a 1
  (?:[02-9][0-9]|1[02-9])) # second and third digit cannot be "11" 
 )                         # end capture group for exchange
 [-. ]?                    # common delimiters between exchange and SN, not captured
 ([0-9]{4})                # subscriber number, captured in $3
 (?:                       # start non-capturing group for optional extension 
 \s*(?:x|ext|ext.)\s*      # common prefixes before extension numbers
 (\d+)                     # optional extension, captured in $4
 ){0,1}                    # end non-capturing group
 $                         # end of string, or EOL in multi-line mode

This handles calling codes (optional), semi-validated area codes (optional) and exchange codes, extension numbers (optional), and captures each portion of the phone number in a separate variable for easy extraction and manipulation.

Using this expression in .NET, you would need to include the IgnorePatternWhitespace and MultiLine flags so commas are ignored and the ^ and $ characters find phone numbers on any line in the string.

richardtallent