tags:

views:

73

answers:

2

The following python script allows me to scrape email addresses from a given file using regular expressions.

I'm trying to add phone numbers to the regular expression also. I created this regex and seems to work on 7 and 10 digit numbers:

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

Can this just be added to my existing regular expression? I figure I need to edit where I use re.compile but not completely sure how to do this in python. Any help would be appreciated.

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
        data = open(filename,'r')
        bulkemails = data.read()
else:
        print "File not found."
        raise SystemExit

# regex = [email protected]
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
        emails += str(x)+"\n"

# function to write file
def writefile():
        f = open(newfilename, 'w')
        f.write(emails)
        f.close()
        print "File written."

EDIT When running on http://en.wikipedia.org/wiki/Telephone_number It produces the following output:

2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
2678400
8790468
9664261
555-1212
555-9225
555-1212
869-1234
555-5555
555-1212
867-5309
867-5309
867-5309
(267) 867-5309
(212) 736-5000
243-3460
2977743
1000000
2048000
2048000
8790468
9070412
9664261
9664261
9664261
+1  A: 

I would not advise combining the two regexes. It's possible, but it will make for code which is harder to understand and maintain down the road.

(Also, leaving the regexes separate will let you handle emails and phone numbers differently down the line, which you're likely to want to do.)

pjmorse
A: 

For one, I would simplify your regex:

(?:\(?\b\d{3}\)?[-.\s]*)?\d{3}[-.\s]*\d{4}\b

will match the same correct numbers as before and have fewer false hits.

Second, your e-mail regex will miss a lot of valid e-mail addresses and have many false positives, too (it would match aaaa@@@@aaaa, for example). While you can never match e-mail address with 100 % reliability using regex, the following one is better, too:

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}\b

(Use the case insensitive option when compiling it).

To restrict yourself to some few TLDs, you can use

\b[A-Z0-9._%+-]+@(?:[A-Z0-9-]+\.)+(?:asia|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[A-Z]{2})\b
Tim Pietzcker
Thanks for the modified regex. How do you specify case insensitive option when compiling?
Aaron
And, you happen to know of a simple way to specify only TLD's for the email address?
Aaron
`re.compile("regex", re.I)`, and why would you want to limit your regex to TLDs?
Tim Pietzcker
Cool, I was just thinking to help verify the emails even more.
Aaron
Not a good idea. You'll have to send an email to a potential address anyway to verify - no regex and no parser can find out if an address actually exists.
Tim Pietzcker
Ok that makes sense. Is there a way to put both of these together? Or is it best to do each one separately in python?
Aaron
I just went ahead and did each separately and seems to be working well. So I found a list of TLD's and looked like there were 20 or so. There's no way to manually add these into the regex? I know you said it wasn't a good idea but was just wondering if its possible. Thanks again for all your help with this.
Aaron
So I tried running the first regex for phone numbers on a html page and it is giving me interesting results. Can you check out the edit that I posted above? I'm not sure where the first few items are even coming from.
Aaron
Tim Pietzcker
That makes sense, I completely ignored the fact that it could be coming from the html. Thanks for getting back to me so quickly.
Aaron