tags:

views:

76

answers:

3

Hi all, I have to extract all email addresses from some .txt documents. These emails may have these formats:

  1. [email protected]
  2. {a, b, c}@abc.edu
  3. some other formats including some @ signs.

I choose ruby for my first language to write this program, but i don't know how to write the regex. Would someone help me? Thank you!

+1  A: 

Have a look at this rather in-depth analysis:

Upshot is use this regular expression:

/^([\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+\.)*[\w\!\#$\%\&\'\*\+\-\/\=\?\^\`{\|\}\~]+@((((([a-z0-9]{1}[a-z0-9\-]{0,62}[a-z0-9]{1})|[a-z])\.)+[a-z]{2,6})|(\d{1,3}\.){3}\d{1,3}(\:\d{1,5})?)$/i
Jonathan
A: 

The better expression to use is the following:

/^[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?$/ig

The other versions can get more false positives since it is a little more permissible with domain name extensions.

The simple version that will fail some cases to learn from would be: ([a-zA-Z0-9\-_+]*@([a-zA-Z0-9\-_+].)?[a-zA-Z0-9\-_+].[a-zA-Z0-9]{2,6})

Aaron Harun
A: 

Depending on the nature of your .txt documents, you don't have to use one of the complicated regexes that attempt to validate email addresses. You're not trying to validate anything. You're just trying to grab what's already there. Generally speaking, a regex to grab what's already there can be much simpler than a regex that needs to validate input.

An important question is whether your .txt documents contain @ signs that are not part of an email address you want to extract.

This regex handles your first two requirements:

\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-]+

Or if you want to allow any sequence of non-space characters containing an @ sign, plus your second requirement (which has spaces):

\S+@\S+|\{(?:\w+, *)+\w+\}@[\w.-]+
Jan Goyvaerts