views:

548

answers:

5

I'm working a piece of code to turn phone numbers into links for mobile phone - I've got it but it feels really dirty.

import re
from string import digits

PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_. ]{0,1}\d{4})')

def numbers2links(s):
    result = ""
    last_match_index = 0
    for match in PHONE_RE.finditer(s):
          raw_number = match.group()
          number = ''.join(d for d in raw_number if d in digits)
          call = '<a href="tel:%s">%s</a>' % (number, raw_number)
          result += s[last_match_index:match.start()] + call
          last_match_index = match.end()
    result += s[last_match_index:]
    return result

>>> numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")
'Ghost Busters at <a href="tel:5554232368">(555) 423-2368</a>! How about this one: <a href="tel:5554567890">555 456 7890</a>! <a href="tel:5554567893">555-456-7893</a> is where its at.'

Is there anyway I could restructure the regex or the the regex method I'm using to make this cleaner?

Update

To clarify, my question is not about the correctness of my regex - I realize that it's limited. Instead I'm wondering if anyone had any comments on the method of substiting in links for the phone numbers - is there anyway I could use re.replace or something like that instead of the string hackery that I have?

A: 

A few things that will clean up your existing regex without really changing the functionality:

Replace {0,1} with ?, [(] with (, [)] with ). You also might as well just make your [2-9] b e a \d as well, so you can make those patterns be \d{3} and \d{4} for the last part. I doubt it will really increase the rate of false positives.

Kamil Kisiel
A: 

Why not re-use the work of others - for example, from RegExpLib.com?

My second suggestion is to remember there are other countries besides the USA, and quite a few of them have telephones ;-) Please don't forget us during your software development.

Also, there is a standard for the formatting of telephone numbers; the ITU's E.123. My recollection of the standard was that what it describes doesn't match well with popular usage.

Edit: I mixed up G.123 and E.123. Oops. Props Bortzmeyer

Oddthinking
+1  A: 

Your regexp only parses a specific format, which is not the international standard. If you limit yourself to one country, it may work.

Otherwise, the international standard is ITU E.123 : "Notation for national and international telephone numbers, e-mail addresses and Web addresses"

bortzmeyer
+1  A: 

First off, reliably capturing phone numbers with a single regular expression is notoriously difficult with a strong tendency to being impossible. Not every country has a definition of a "phone number" that is as narrow as it is in the U.S. Even in the U.S., things are more complicated than they seem (from the Wikipedia article on the North American Numbering Plan):

  • A) Country code: optional prefix ("1" or "+1" or "001")
    • ((00|\+)?1)?
  • B) Numbering Plan Area code (NPA): cannot begin with 1, digit 2 cannot be 9
    • [2-9][0-8][0-9]
  • C) Exchange code (NXX): cannot begin with 1, cannot end with "11", optional parentheses
    • \(?[2-9](00|[2-9]{2})\)?
  • D) Station Code: four digits, cannot all be 0 (I suppose)
    • (?!0{4})\d{4}
  • E) an optional extension may follow
    • ([x#-]\d+)?
  • S) parts of the number are separated by spaces, dashes, dots (or not)
    • [. -]?

So, the basic regex for the U.S. would be:

((00|\+)?1[. -]?)?\(?[2-9][0-8][0-9]\)?[. -]?[2-9](00|[2-9]{2})[. -]?(?!0{4})\d{4}([. -]?[x#-]\d+)?
| A       |S   |  |   B                | S   |   C             | S  |  D           | S  |  E      |

And that's just for the relatively trivial numbering plan of the U.S., and even there it certainly is not covering all subtleties. If you want to make it reliable you have to develop a similar beast for all expected input languages.

Tomalak
+3  A: 

Nice first take :) I think this version is a bit more readable (and probably a teensy bit faster). The key thing to note here is the use of re.sub. Keeps us away from the nasty match indexes...

import re

PHONE_RE = re.compile('([(]{0,1}[2-9]\d{2}[)]{0,1}[-_. ]{0,1}[2-9]\d{2}[-_.  ]{0,1}\d{4})')
NON_NUMERIC = re.compile('\D')

def numbers2links(s):

   def makelink(mo):
      raw_number = mo.group()
      number = NON_NUMERIC.sub("", raw_number)
      return '<a href="tel:%s">%s</a>' % (number, raw_number)

   return PHONE_RE.sub(makelink, s)


print numbers2links("Ghost Busters at (555) 423-2368! How about this one: 555 456 7890! 555-456-7893 is where its at.")

A note: In my practice, I've not noticed much of a speedup pre-compiling simple regular expressions like the two I'm using, even if you're using them thousands of times. The re module may have some sort of internal caching - didn't bother to read the source and check.

Also, I replaced your method of checking each character to see if it's in string.digits with another re.sub() because I think my version is more readable, not because I'm certain it performs better (although it might).

Triptych
That's about 100x clearer - thanks Triptych!
jb
Glad to help out :) I've spent many a night weeping over the details of the python re module. Glad to save someone else a little pain.
Triptych