ansaurus

Question

Answer 1

A:

A few things that will clean up your existing regex without really changing the functionality:

Replace {0,1} with ?, [(] with (, [)] with ). You also might as well just make your [2-9] b e a \d as well, so you can make those patterns be \d{3} and \d{4} for the last part. I doubt it will really increase the rate of false positives.

Kamil Kisiel 2008-12-22 05:57:05

Answer 2

A:

Why not re-use the work of others - for example, from RegExpLib.com?

My second suggestion is to remember there are other countries besides the USA, and quite a few of them have telephones ;-) Please don't forget us during your software development.

Also, there is a standard for the formatting of telephone numbers; the ITU's E.123. My recollection of the standard was that what it describes doesn't match well with popular usage.

Edit: I mixed up G.123 and E.123. Oops. Props Bortzmeyer

Oddthinking 2008-12-22 06:37:28

Answer 3

+1 A:

Your regexp only parses a specific format, which is not the international standard. If you limit yourself to one country, it may work.

Otherwise, the international standard is ITU E.123 : "Notation for national and international telephone numbers, e-mail addresses and Web addresses"

bortzmeyer 2008-12-22 08:55:16

Answer 4

+1 A:

First off, reliably capturing phone numbers with a single regular expression is notoriously difficult with a strong tendency to being impossible. Not every country has a definition of a "phone number" that is as narrow as it is in the U.S. Even in the U.S., things are more complicated than they seem (from the Wikipedia article on the North American Numbering Plan):

A) Country code: optional prefix ("1" or "+1" or "001")
- ((00|\+)?1)?
B) Numbering Plan Area code (NPA): cannot begin with 1, digit 2 cannot be 9
- [2-9][0-8][0-9]
C) Exchange code (NXX): cannot begin with 1, cannot end with "11", optional parentheses
- \(?[2-9](00|[2-9]{2})\)?
D) Station Code: four digits, cannot all be 0 (I suppose)
- (?!0{4})\d{4}
E) an optional extension may follow
- ([x#-]\d+)?
S) parts of the number are separated by spaces, dashes, dots (or not)
- [. -]?

So, the basic regex for the U.S. would be:

((00|\+)?1[. -]?)?\(?[2-9][0-8][0-9]\)?[. -]?[2-9](00|[2-9]{2})[. -]?(?!0{4})\d{4}([. -]?[x#-]\d+)?
| A       |S   |  |   B                | S   |   C             | S  |  D           | S  |  E      |

And that's just for the relatively trivial numbering plan of the U.S., and even there it certainly is not covering all subtleties. If you want to make it reliable you have to develop a similar beast for all expected input languages.

Tomalak 2008-12-22 09:41:01

Answer 5

+3 A:

Nice first take :) I think this version is a bit more readable (and probably a teensy bit faster). The key thing to note here is the use of re.sub. Keeps us away from the nasty match indexes...

import re

PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_.  ]{0,1}\d{4})')
NON_NUMERIC = re.compile('\D')

def numbers2links(s):

   def makelink(mo):
      raw_number = mo.group()
      number = NON_NUMERIC.sub("", raw_number)
      return '<a href="tel:%s">%s</a>' % (number, raw_number)

   return PHONE_RE.sub(makelink, s)


print numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")

A note: In my practice, I've not noticed much of a speedup pre-compiling simple regular expressions like the two I'm using, even if you're using them thousands of times. The re module may have some sort of internal caching - didn't bother to read the source and check.

Also, I replaced your method of checking each character to see if it's in string.digits with another re.sub() because I think my version is more readable, not because I'm certain it performs better (although it might).

Triptych 2008-12-22 19:26:03

That's about 100x clearer - thanks Triptych!

jb 2008-12-22 21:14:55

Glad to help out :) I've spent many a night weeping over the details of the python re module. Glad to save someone else a little pain.

Triptych 2008-12-22 21:42:51

ansaurus

tags:

views:

answers:

Phone numbers to links in Python

related questions