ansaurus

Question

Python regex for fixing Australian/New Zealand Phone Numbers

Answer 1

+4 A:

Don't use complicated regexes. Delete EVERYTHING except digits -- non-digits are error-prone cruft. If the third digit is 0, delete it. Expect 61 followed by valid AUS area code ([23478] for generality NB 4 is for mobiles) then 8 digits or 64 followed by valid NZL area code (whatever that is) followed by 7 digits. Anything else is bad. In the good stuff, insert the +()- at the appropriate places.

By the way (1) area code 2 is for the whole of NSW+ACT, not just Sydney, 3 is for VIC+TAS (2) lots of people these days don't have landlines, just mobiles, and people tend to retain the same mobile phone number longer than they maintain the same landline phone number or the same postal address, so mobile phone number is great for fuzzy matching customer records -- so I'm more than a little curious why you don't include them.

The following tell you all you ever wanted to know, plus a whole lot more, about the Australian and New Zealand phone numbering schemes.

Comment on the regexes:

(1) You are using the search method with a "^" prefix. Using the match method with no prefix is somewhat less inelegant.

(2) You don't seem to be checking for trailing rubbish in your phone number field:

>>> import re
>>> standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\
)(?P<local_first_half>\d{3,4})-(?P<local_second_half>\d{4})')
>>> m =standard_format.search("+61(3)1234-567890whoopsie")
>>> m.groups()
('61', '3', '1234', '5678')
>>>

You may like to (a) end some of your regexes with \Z (NOT $) so that they don't match OK when there is trailing rubbish or (b) introduce another group to catch trailing rubbish.

and a social engineering comment: Have you yet tested the user reaction to a staff member carrying out this directive: "Space instead of hyphen in local component - ask user to remediate"? Can't the script just fix it and carry on?

and some comments on the code:

the self.PHFull code

(a) is terribly repetitive (if you must have regexes put them in a list with corresponding action codes and error messages and iterate over the list)

(b) is the same for "error" cases as for standard cases (so why are you asking the users to "remediate"???)

(c) throws away the country code and substitutes a 0 i.e. your standard +61(2)1234-5678 is being kept as 0212345678 aarrgghhh ... even if you have the country stored with the address that's no good if an NZer migrates to Aus and the address gets updated but not the phone number and please don't say that you are relying on the current (no NZ customers outside the Auckland area???) non-overlap of area codes ...

Update after full story revealed

Keep it SIMPLE for both you and the staff. Instructions to staff using Active Directory should be (depending on which office) "Fill in +61(2)9876-7 followed by your 3-digit extension number". If they can't get that right after a couple of attempts, it's time they got the DCM.

So you use one regex per office, filling in the constant part, so that say the SYD offices have numbers of the form +61(2)9876-7ddd you use the regex r"\+61$2$9876-7\d{3,3}\Z". If a regex matches, then you remove all non-digits and use "0" + the_digits[2:] for the next app. If no regexes match, send a rocket.

John Machin 2010-07-07 04:59:09

It's amusing that we posted pretty much the same comment at exactly the same time. Heh.

Alex Bliskovsky 2010-07-07 05:00:41

Be very careful. There are many australian phone numbers that don't look valid, but are. Such as Radio Telephone numbers. I once had access to a database of the phonebook, and if I recall correctly, there are 6, 7, 8 and 9 digit telephone numbers.

Jerub 2010-07-07 05:13:26

@Jerub: AFAIK that applied only before the Great Renumbering -- in fact there were 5-digit phone numbers in the Sydney CBD up until some time in the 1990s.

John Machin 2010-07-07 05:29:52

The data I had was current as of 2003.

Jerub 2010-07-07 05:45:04

@Jerub: I'm finding that a bit hard to believe, unless it included pre-renumbering numbers linked to current numbers. Please read the link that I posted and tell us which "area code" prefixes you are referring to.

John Machin 2010-07-07 06:01:36

Answer 2

A:

Phone numbers are formatted that way to make them easier to remember for people-- there's no reason that I can see for storing them like that. Why not split by commas and parse each number by simply ignoring anything that's not a digit?

>>> import string
>>> def parse_number(number):
    n = ''
    for x in number:
        if x in string.digits:
            n += x
    return n

Once you've got it like that you can do verification based on the itl prefix and area code. (if the 3rd digit is 3 then there should be 7 more digits, etc)

After it's verified, splitting into components is easy. The first two digits are the prefix, the next is the area code, etc. You can do a check for all the common mistakes without using regex. Outputting is also pretty easy in this case.

Alex Bliskovsky 2010-07-07 04:59:45

"(if the 3rd digit is 3 then there should be 7 more digits, etc)": This is not a valid rule for the OP's AU in any case the validity of the subsequent digits is determined by the international prefix, and perhaps by whether a mobile or landline is involved. Different rulesets are needed for different international prefixes (which in genrrality range from 1 digit long to 3)

John Machin 2010-07-07 05:27:06

Alternatively: `filter(str.isdigit, x)`

Ken 2010-07-07 05:58:16

Answer 3

+3 A:

+1 for @John Machin's recommendations.

The World Telephone Number Guide is quite useful for national numbering plans, especially the exceptions.

The ITU has freely available standards for lots of stuff too.

devstuff 2010-07-07 06:17:52

ansaurus

tags:

views:

answers:

Python regex for fixing Australian/New Zealand Phone Numbers

related questions