views:

275

answers:

2

I would like a regular expression that will extract email addresses from a String (using Java regular expressions).

That really works.

+3  A: 

Here's the regular expression that really works. I've spent an hour surfing on the web and testing different approaches, and most of them didn't work although Google top-ranked those pages.

I want to share with you a working regular expression:

[_A-Za-z0-9-]+(\\.[_A-Za-z0-9-]+)*@[A-Za-z0-9]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})

Here's the original link: http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/

EugeneP
Sorry, this is not right. It will fail for plus-addressing (http://en.wikipedia.org/wiki/E-mail_address#Sub-addressing), among other things (an example is [email protected]). Writing a correct regular expression for email addresses is /very/ hard (if not impossible). See also http://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/201378#201378
Matthew Flaschen
And not talking about ICANN's decision to allow non-latin characters in email addresses: http://stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses/1931322#1931322
BalusC
Well, you're right, I didn't know that a plus sign could be a part of any email address. I can be easily added between square brackets. But I'm pretty sure 99.9% of people do not use it, and most email servers do not allow a plus sign as part of the email address. Absolutely agree that there may be situations where no matter what regular expression will fail on email validation/extraction. Though this one worked for me and I've seen others that did not.
EugeneP
+1  A: 

Install this regex tester plugin into eclipse, and you'd have whale of a time testing regex
http://brosinski.com/regex/.

Points to note:
In the plugin, use only one backslash for character escape. But when you transcribe the regex into a Java/C# string you would have to double them as you would be performing two escapes, first escaping the backslash from Java/C# string mechanism, and then second for the actual regex character escape mechanism.

Surround the sections of the regex whose text you wish to capture with round brackets/ellipses. Then, you could use the group functions in Java or C# regex to find out the values of those sections.

([_A-Za-z0-9-]+)(\.[_A-Za-z0-9-]+)@([A-Za-z0-9]+)(\.[A-Za-z0-9]+)

For example, using the above regex, the following string

[email protected]

yields

start=0, end=16
Group(0) = [email protected]
Group(1) = abc
Group(2) = .efg
Group(3) = asdf
Group(4) = .cde

Group 0 is always the capture of whole string matched.

If you do not enclose any section with ellipses, you would only be able to detect a match but not be able to capture the text.

It might be less confusing to create a few regex than one long catch-all regex, since you could programmatically test one by one, and then decide which regexes should be consolidated. Especially when you find a new email pattern that you had never considered before.

Blessed Geek
@h2g2java Talking about myself, I already use a similar plugin. And I appreciate your answer very much, cuz I also find that without such tools working with regular expressions can be a nightmare. I'm sure your answer will help many people to save their time.
EugeneP