ansaurus

Question

Answer 1

+7 A:

Good Lord Almighty, what a mess! :) If you have high-level semantic or business rules (such as the ones you describe talking about European numbers, numbers in Holland, etc.) you'd probably be better served breaking that single regexp test into several individual regexp tests, one for each of your high level rules.

if number =~ /...../  # Holland mobiles
  # ...
elsif number =~ /..../  # Belgian landlines
  # ...
# etc.
end

It'll be quite a bit easier to read and maintain and change that way.

Pistos 2008-11-06 14:13:20

And order your tests by most likely to match (assuming you know the demographics well enough).

tvanfosson 2008-11-06 14:15:14

@tvanfosson: Sure; agreed.

Pistos 2008-11-06 14:17:25

that i didnt think of that :P thanks :)

youri 2008-11-06 16:14:00

Answer 2

+3 A:

Split it into multiple expressions. For example (pseudo-code)...

phone_no_patterns = [
    /[0-9]{13}/, # 0031201234567
    /+(31|32)\(0\)\d{2}-\d{7}/ # +31(0)20-1234567
    # ..etc..
]
def check_number(num):
    for pattern in phone_no_patterns:
        if num matches pattern:
            return match.groups

Then you just loop over each pattern, checking if each one matches..

Splitting the patterns up makes its easy to fix specific numbers that are causing problems (which would be horrible with the single monolithic regex)

dbr 2008-11-06 14:25:03

Answer 3

+1 A:

It's not an optimization, but you use

(-)?( )?

three times in your regex. This will cause you to match on phone numbers like these

+31(0)6-12345678
+31(0)6 12345678

but will also match numbers containing a dash followed by a space, like

+31(0)6- 12345678

You can replace

(-)?( )?

with

(-| )?

to match either a dash or a space.

Bill the Lizard 2008-11-06 14:25:57

better yet `[- ]?`

Brad Gilbert 2008-11-06 15:18:08

That is better. Your solution saves a character. I was saving myself typing. :)

Bill the Lizard 2008-11-06 15:54:13

i didnt notice i did that thanks

youri 2008-11-06 16:16:18

Answer 4

+2 A:

(31|32) looks bad. When matching 32, the regex engine will first try to match 31 (2 chars), fail, and backtrack two characters to match 31. It's more efficient to first match 3 (one character), try 1 (fail), backtrack one character and match 2.

Of course, your regex fails on 0800- numbers; they're not 10 digits.

MSalters 2008-11-06 15:33:12

i dont want 0800 numbers but the other part of your comment was usefull thanks.

youri 2008-11-06 16:13:02

Answer 5

+4 A:

First observation: reading the regex is a nightmare. It cries out for Perl's /x mode.

Second observation: there are lots, and lots, and lots of capturing parentheses in the expression (42 if I count correctly; and 42 is, of course, "The Answer to Life, the Universe, and Everything" -- see Douglas Adams "Hitchiker's Guide to the Galaxy" if you need that explained).

Bill the Lizard notes that you use '(-)?( )?' several times. There's no obvious advantage to that compared with '-? ?' or possibly '[- ]?', unless you are really intent on capturing the actual punctuation separately (but there are so many capturing parentheses working out which '$n' items to use would be hard).

So, let's try editing a copy of your one-liner:

( |^|>)
(
    ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{2})(-)?( )?)?)([0-9]{7})) |
    ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{3})(-)?( )?)?)([0-9]{6})) |
    ((((((\+|00)(31|32)( )?(\(0\))?)|0)([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)

OK - now we can see the regular structure of your regular expression.

There's much more analysis possible from here. Yes, there can be vast improvements to the regular expression. The first, obvious, one is to extract the international prefix part, and apply that once (optionally, or require the leading zero) and then apply the national rules.

( |^|>)
(
    (((\+|00)(31|32)( )?(\(0\))?)|0)
    (((([0-9]{2})(-)?( )?)?)([0-9]{7})) |
    (((([0-9]{3})(-)?( )?)?)([0-9]{6})) |
    (((([0-9]{1})(-)?( )?)?)([0-9]{8}))
)
( |$|<)

Then we can simplify the punctuation as noted before, and remove some plausibly redundant parentheses, and improve the country code recognizer:

( |^|>)
(
    (((\+|00)3[12] ?(\(0\))?)|0)
    (((([0-9]{2})-? ?)?)[0-9]{7}) |
    (((([0-9]{3})-? ?)?)[0-9]{6}) |
    (((([0-9]{1})-? ?)?)[0-9]{8})
)
( |$|<)

We can observe that the regex does not enforce the rules on mobile phone codes (so it does not insist that '06' is followed by 8 digits, for example). It also seems to allow the 1, 2 or 3 digit 'exchange' code to be optional, even with an international prefix - probably not what you had in mind, and fixing that removes some more parentheses. We can remove still more parentheses after that, leading to:

( |^|>)
(
    (((\+|00)3[12] ?(\(0\))?)|0)    # International prefix or leading zero
    ([0-9]{2}-? ?[0-9]{7}) |        # xx-xxxxxxx
    ([0-9]{3}-? ?[0-9]{6}) |        # xxx-xxxxxx
    ([0-9]{1}-? ?[0-9]{8})          # x-xxxxxxxx
)
( |$|<)

And you can work out further optimizations from here, I'd hope.

Jonathan Leffler 2008-11-06 15:36:54

thank you i did break it up for my self to see if i could achieve this but i must've done something wrong... thanks this is really helpfull

youri 2008-11-06 16:15:03

ansaurus

tags:

views:

answers:

Can I optimize this phone-regex?

related questions