tags:

views:

128

answers:

3

Hello, the following python script allows me to scrape email addresses from a given file using regular expressions.

How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?

My current script can be found below:

# filename variables
filename = 'file.txt'
newfilename = 'result.txt'

# read the file
if os.path.exists(filename):
        data = open(filename,'r')
        bulkemails = data.read()
else:
        print "File not found."
        raise SystemExit

# regex = [email protected]
r = re.compile(r'(\b[\w.]+@+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
        emails += str(x)+"\n"

# function to write file
def writefile():
        f = open(newfilename, 'w')
        f.write(emails)
        f.close()
        print "File written."

Regex for phone numbers:

(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})

Another regex for phone numbers:

(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?
+2  A: 

Dive Into Python has a specific example of what you're looking for here:

http://diveintopython.org/regular_expressions/phone_numbers.html

Dan McDougall
A: 

If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.

Edit: Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.

(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})

It matched the following values in RegexPal:

000-000-0000
000 000 0000
000.000.0000

(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000

000-0000
000 0000
000.0000

0000000
0000000000
(000)0000000
Auguste
Thanks, I have found the RegexPal rather helpful. I added to my post and included what I have so far for phone numbers. Something I'm having difficulty doing is detecting 7 or 10 digit numbers that don't have any hyphens at all.
Aaron
@Aaron, I took a shot at modifying the Regex you gave to solve your problem. It's included in my answer, which I edited. Give it a try and see if it works.
Auguste
This looks really great. I just did some testing and appears to be working very well. My only question, how do implement this so that it can work with my existing email addresses? Is there a way to do this that isn't too much work? Thanks again
Aaron
You should be able to implement it similarly to the way the email Regex is already implemented; Try modifying a copy of the `# regex = [email protected]` block. I'd take a shot, but I've never touched Python before. You simply want to search for matches to the phone number Regex within the file you open, and output them to `result.txt`.
Auguste
Hey, what's the status on your project? Did you get it implemented?
Auguste