views:

168

answers:

4

Possible Duplicates:
A comprehensive regex for phone number validation
grep with regex for phone number

Hello Everyone,

I am new to Stackoverflow and I have a quick question. Let's assume we are given a large number of HTML files (large as in theoretically infinite). How can I use Regular Expressions to extract the list of Phone Numbers from all those files?

Explanation/expression will be really appreciated. The Phone numbers can be any of the following formats:

  • (123) 456 7899
  • (123).456.7899
  • (123)-456-7899
  • 123-456-7899
  • 123 456 7899
  • 1234567899

Thanks a lot for all your help and have a good one!

+1  A: 

/^[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{3})[\.-)( ]*([0-9]{4})$/

Should accomplish what you are trying to do.

The first part ^ means the "start of the line" which will force it to account for the whole string.

The [\.-)( ]* that I have in there mean "any period, hyphen, parenthesis, or space appearing 0 or more times".

The ([0-9]{3}) clusters match a group of 3 numbers (the last one is set to match 4)

Hope that helps!

webdestroya
`.*` is pretty heavy handed. You'll probably be picking up a lot of false positives.
Stephen
Not unless someone throws in extra numbers.
webdestroya
@web What about the sentence "I wanted 100 boxes, but the guy gave me 200 instead, and I had to pay $1000 for them!" Plus since this is HTML you could get stuff like `<div style="width:100px; height:200px; color:#ff0000">`
Michael Mrozek
@Michael Mrozek - Ok, I fixed it to be more specific
webdestroya
@webdestroya: Thanks for the reply, but don't you think this implementation will lead to the following WRONG syntax to be passed as well: )123).)456).9999 ?
Rocky
@Rocky - I guess you will have to adapt it to what conditions you are expecting, but yes, it would work on that.
webdestroya
@webdestroya: Perfect! Thanks.
Rocky
+1  A: 

This will help you catch the ones with an area code in parentheses

([0-9]\{3\})[ .-][0-9]\{3\}[ .-][0-9]\{4\}

The others are:

[0-9]\{3\}[ -][0-9]\{3\}[ -][0-9]\{4\}
[0-9]\{10\}

I separated the first one and the second one because putting them together without backtracking could get you into accepting (123 456 7890 or 123) 456 7890

Note also that on my terminal using grep, I had to escape the { } for the repetition. You may not have to, or you may have to escape other characters depending on where you intend to use this.

Phil
+1  A: 

^(\(?\d{3}\)?)([ .-])(\d{3})([ .-])(\d{4})$

This should match all except the last pattern. For the last one you could use a separated pattern ^\d{10}$

And there is a error, it will match (123 456 7899

  1. ^(\(?\d{3}\)?), if we break this code, the first character (^) matches the beginning of the text. \(? and \)? will accept or not this character, there is the problem to do that you have to check if there was an opening char, if there were the second have to match, I don't know if it is possible using Regex only. And \d{3} will match three numbers

  2. ([ .-]) will match any of those, but only one and only once.

  3. (\d{3}) will match three numbers

  4. Same as 2

  5. (\d{4})$ four numbers followed by the end of the text ($)

Since you want to extract from an HTML page you would have to ignore ^ and $ to match any part of the text and set a flag global, in javascript /exp/g

You can test Regex here

BrunoLM
+1  A: 

Without knowing what language you're using I am unsure whether or not the syntax is correct.

This should match all of your groups with very few false positives:

/\(?([0-9]{3})\)?([ .-]?)([0-9]{3})\2([0-9]{4})/

The groups you will be interested in after the match are groups 1, 3, and 4. Group 2 exists only to make sure the first and second separator characters , ., or - are the same.

For example a sed command to strip the characters and leave phone numbers in the form 123456789:

sed "s/(\{0,1\}\([0-9]\{3\}\))\{0,1\}\([ .-]\{0,1\}\)\([0-9]\{3\}\)\2\([0-9]\{4\}\)/\1\3\4/"

Here are the false positives of my expression:

  • (123)456789
  • (123456789
  • (123 456 789
  • (123.456.789
  • (123-456-789
  • 123)456789
  • 123) 456 789
  • 123).456.789
  • 123)-456-789

Breaking up the expression into two parts, one that matches with parenthesis and one that does not will eliminate all of these false positives except for the first one:

/\(([0-9]{3})\)([ .-]?)([0-9]{3})\2([0-9]{4})|([0-9]{3})([ .-]?)([0-9]{3})\5([0-9]{4})/

Groups 1, 3, and 4 or 5, 7, and 8 would matter in this case.

Trey
Thanks everyone for the response and sorry for the inconvenience that the repetition of this already existing question that might have caused you.
Rocky