views:

32678

answers:

21

I'm trying to put together a comprehensive regex to validate phone numbers. Ideally it would handle international formats, but it must handle US formats, including the following:

  • 1-234-567-8901
  • 1-234-567-8901 x1234
  • 1-234-567-8901 ext1234
  • 1 (234) 567-8901
  • 1.234.567.8901
  • 1/234/567/8901
  • 12345678901

I'll answer with my current attempt, but I'm hoping somebody has something better and/or more elegant.

A: 

Here's my best try so far. It handles the formats above but I'm sure I'm missing some other possible formats.

^\d?(?:(?:[\+]?(?:[\d]{1,3}(?:[ ]+|[\-.])))?[(]?(?:[\d]{3})[\-/)]?(?:[ ]+)?)?(?:[a-zA-Z2-9][a-zA-Z0-9 \-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\.]?)){1,2}(?:[\d]{1,5}))?$
Nicholas Trandem
+8  A: 

Have you had a look over at RegExLib?

Entering US phone number brought back quite a list of possibilities.

Rob Wells
i've never seen that one before. nice website.
Andrew Garrison
+2  A: 

Trying to build a comprehensive regex from scratch is usually a bad idea, unless you have good hard reasons for implementing it. Are you in direct contact with SMSCs, or other telcom operated hardware? If that's the case, you should be able to get this sort of validation related information from them.

Internet Friend
+3  A: 

What language are you using? If you're using Perl, for example, you use the Regexp::Common library on CPAN.

Andy Lester
+21  A: 

It turns out that there's something of a spec for this, at least for North America, called the NANP.

You need to specify exactly what you want. What are legal delimiters? Spaces, dashes, and periods? No delimiter allowed? Can one mix delimiters (e.g., +0.111-222.3333)? How are extensions (e.g., 111-222-3333 x 44444) going to be handled? What about special numbers, like 911? Is the area code going to be optional or required?

Here's a regex for a 7 or 10 digit number, with extensions allowed, delimiters are spaces, dashes, or periods:

^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$
fatcat1111
here it is without the extension section (I make my users enter ext in a separate field): ^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})$
DJTripleThreat
This worked nicely for me. I needed to update the extension part though by adding a slash before the #, otherwise it says from there over is a comment
Brian Surowiec
What about adding "(" and ")" to that list of delimiters?
Jeremy Ricketts
Here is a version that only matches 10 digit phone numbers (not 7 digit like 843-1212): `/(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})/`
Brian Armstrong
+45  A: 

Better option... just strip all non-digit characters on input (except 'x').

then, you end up with values like:

 12345678901
 12345678901x1234
 345678901x1234
 12344678901
 12345678901
 12345678901
 12345678901

Then when you display, reformat to your hearts content. e.g.

  1 (234) 567-8901
  1 (234) 567-8901 x1234
scunliffe
Nice! I guess I was trying to design Complicator's Gloves (http://thedailywtf.com/Articles/The_Complicator_0x27_s_Gloves.aspx). This is much more elegant.
Nicholas Trandem
just glad I could offer up another useful option!
scunliffe
Also, I would strongly recommend that you store the number in the database as a String, not a number.
Kip
The formatting code is going to be a waste of time if the numbers are allowed to come from outside the US.
Daniel Earwicker
@Earwicker - agreed the formatting (if dealing with international #'s) should be smart enough to handle various formats... e.g. (0123) 456 7890 or +1 234 567-89-01. Depending how complex you want to get it should be something that can be figured out based on the number of digits and what the first few digits are.
scunliffe
Don't strip the +! What if the number is `+44 207 7845 7500`? You'd get a weird number here: `4 (420) 778-457800`
configurator
@configurator - agreed I think I was trying to overly simplify the concept. +1 for keep the "+"
scunliffe
This is good and all, but it doesn't validate what was entered was actually a phone number. For example, what if the user doesn't enter the requisite 10 digits? This should be combined with good regex validation.
Hugh Jeffner
Thank you kindly for the Complicator's Gloves article, Nick ;)
Lee Fogel
+2  A: 

I work for a market research company and we have to filter these types of input alllll the time. You're complicating it too much. Just strip the non-alphanumeric chars, and see if there's an extension.

For further analysis you can subscribe to one of many providers that will give you access to a database of valid numbers as well as tell you if they're landlines or mobiles, disconnected, etc. It costs money.

Joe Philllips
A: 

Is it possible to have the for display 4 separate fields ( Area Code, 3-digit prefix, 4 digit part, extension) so that they can input each part of the address separately, and you can verify each piece individually? That way you can not only make verification much easier, you can store your phone numbers in a more consistent format in the database.

Kibbee
+4  A: 

You'll have a hard time dealing with international numbers with a single/simple regex, see this post on the difficulties of international (and even north american) phone numbers.

You'll want to parse the first few digits to determine what the country code is, then act differently based on the country.

Beyond that - the list you gave does not include another common US format - leaving off the initial 1. Most cell phones in the US don't require it, and it'll start to baffle the younger generation unless they've dialed internationally.

You've correctly identified that it's a tricky problem...

Adam Davis
+1  A: 

I believe the Number::Phone::US and Regexp::Common (particularly the source of Regexp::Common::URI::RFC2806) Perl modules could help.

The question should probably be specified in a bit more detail to explain the purpose of validating the numbers. For instance, 911 is a valid number in the US, but 911x isn't for any value of x. That's so that the phone company can calculate when you are done dialing. There are several variations on this issue. But your regex doesn't check the area code portion, so that doesn't seem to be a concern.

Like validating email addresses, even if you have a valid result you can't know if it's assigned to someone until you try it.

If you are trying to validate user input, why not normalize the result and be done with it? If the user puts in a number you can't recognize as a valid number, either save it as inputted or strip out undailable characters. The Number::Phone::Normalize Perl module could be a source of inspiration.

Jon Ericson
+2  A: 

There's a nice tutorial on this very problem in the excellent Dive Into Python. But I think scunliffe's answer is much simpler. Sometimes the best solution to a regex problem is to not use a regular expression!

davidavr
+7  A: 

Although the answer to strip all whitespace is neat, it doesn't really solve the problem that's posed, which is to find a regex. Take, for instance, my test script that downloads a web page and extracts all phone numbers using the regex. Since you'd need a regex anyway, you might as well have the regex do all the work. I came up with this:

1?\W*([2-9][0-8][0-9])\W*([2-9][0-9]{2})\W*([0-9]{4})(\se?x?t?(\d*))?

Here's a perl script to test it. When you match, $1 contains the area code, $2 and $3 contain the phone number, and $5 contains the extension. My test script downloads a file from the internet and prints all the phone numbers in it.

#!/usr/bin/perl

my $us_phone_regex =
        '1?\W*([2-9][0-8][0-9])\W*([2-9][0-9]{2})\W*([0-9]{4})(\se?x?t?(\d*))?';


my @tests =
(
"1-234-567-8901",
"1-234-567-8901 x1234",
"1-234-567-8901 ext1234",
"1 (234) 567-8901",
"1.234.567.8901",
"1/234/567/8901",
"12345678901",
"not a phone number"
);

foreach my $num (@tests)
{
        if( $num =~ m/$us_phone_regex/ )
        {
                print "match [$1-$2-$3]\n" if not defined $4;
                print "match [$1-$2-$3 $5]\n" if defined $4;
        }
        else
        {
                print "no match [$num]\n";
        }
}

#
# Extract all phone numbers from an arbitrary file.
#
my $external_filename =
        'http://web.textfiles.com/ezines/PHREAKSANDGEEKS/PnG-spring05.txt';
my @external_file = `curl $external_filename`;
foreach my $line (@external_file)
{
        if( $line =~ m/$us_phone_regex/ )
        {
                print "match $1 $2 $3\n";
        }
}

Edit:

You can change \W* to \s*\W?\s* in the regex to tighten it up a bit. I wasn't thinking of the regex in terms of, say, validating user input on a form when I wrote it, but this change makes it possible to use the regex for that purpose.

'1?\s*\W?\s*([2-9][0-8][0-9])\s*\W?\s*([2-9][0-9]{2})\s*\W?\s*([0-9]{4})(\se?x?t?(\d*))?';
indiv
A: 

My inclination is to agree that stripping non-digits and just accepting what's there is best. Maybe to ensure at least a couple digits are present, although that does prohibit something like an alphabetic phone number "ASK-JAKE" for example.

A couple simple perl expressions might be:

@f = /(\d+)/g;
tr/0-9//dc;

Use the first one to keep the digit groups together, which may give formatting clues. Use the second one to trivially toss all non-digits.

Is it a worry that there may need to be a pause and then more keys entered? Or something like 555-1212 (wait for the beep) 123?

piCookie
+3  A: 

If you're talking about form validation, the regexp to validate correct meaning as well as correct data is going to be extremely complex because of varying country and provider standards. It will also be hard to keep up to date.

I interpret the question as looking for a broadly valid pattern, which may not be internally consistent - for example having a valid set of numbers, but not validating that the trunk-line, exchange, etc. to the valid pattern for the country code prefix.

North America is straightforward, and for international I prefer to use an 'idiomatic' pattern which covers the ways in which people specify and remember their numbers:

^(((((\d{3}))|(\d{3}-))\d{3}-\d{4})|(+?\d{2}((-| )\d{1,8}){1,5}))(( x| ext)\d{1,5}){0,1}$

The North American pattern makes sure that if one parenthesis is included both are. The international accounts for an optional initial '+' and country code. After that, you're in the idiom. Valid matches would be:

 (xxx)xxx-xxxx
 (xxx)-xxx-xxxx
 (xxx)xxx-xxxx x123
 12 1234 123 1 x1111
 12 12 12 12 12 
 12 1 1234 123456 x12345
 +12 1234 1234
 +12 12 12 1234
 +12 1234 5678
 +12 12345678

This may be biased as my experience is limited to North America, Europe and a small bit of Asia.

ron0
I've been trying to implement the above in my javascript validation script but I keep getting an `invalid quantifier` error. Any ideas on what I'm doing wrong?
Jannis
A: 

Im starting to think my project isnt going to fly at all, we want to extract phone numbers from hundreds of thousands of word documents, all multi-region documents from all over the world.

This is starting to give me a headache lol, and none of the regexes I have download so far have been able to extract any of the numbers I have given them. Agh! lol :o(

How is this considered an answer?
Sailing Judo
A: 
A: 

Do a replace on formatting characters, then check the remaining for phone validity. In PHP,

 $replace = array( ' ', '-', '/', '(', ')', ',', '.' ); //etc; as needed
 preg_match( '/1?[0-9]{10}((ext|x)[0-9]{1,4})?/i', str_replace( $replace, '', $phone_num );

Breaking a complex regexp like this can be just as effective, but much more simple.

rooskie
+7  A: 

".*"

If the user wants to give you his phone number, then trust him to get it right. If he does not want to give it to you then forcing him to enter a valid number will either send him to a competitor's site or make him enter a random string that fits your regex. I might even be tempted to look up the number of a premium rate sex line and enter that instead.

I would also consider any of the following as valid entries on a web site:

"123 456 7890 until 6pm, then 098 765 4321"

"123 456 7890 or try my mobile on 098 765 4321"

"ex-directory - mind your own business"

Dave Kirby
+3  A: 

Thank you everybody for the explanation. The Regex 1?\W*([2-9][0-8][0-9])\W*([2-9][0-9]{2})\W*([0-9]{4})(\se?x?t?(\d*))? helped me a lot to solve my problem. The answers helped me a lot in giving the Regex for Phone Number with area code.

Venkat
+1  A: 

I wrote simpliest (although i didn't need dot in it).

^([0-9\(\)\/\+ \-]*)$
Artjom Kurapov
Didn't work for me.
Brian Armstrong
+1  A: 

note that stripping () characters does not work for a style of writing UK numbers that is common: +44 (0) 1234 567890 which means dial either the international number: +441234567890 or in the UK dial 01234567890

Ben Clifford