views:

2800

answers:

21

I recently read somewhere that writing a regexp to match an email address, taking into account all the variations and possibilities of the standard is extremely hard and is significantly more complicated than what one would initially assume.

Can anyone provide some insight as to why that is?

Are there any known and proven regexps that actually do this fully?

What are some good alternatives to using regexps for matching email addresses?

+32  A: 

For the formal e-mail spec, yes, it is technically impossible via Regex due to the recursion of things like comments (especially if you don't remove comments to whitespace first), and the various different formats (an e-mail address isn't always [email protected]). You can get close (with some massive and incomprehensible Regex patterns), but a far better way of checking an e-mail is to do the very familiar handshake:

  • they tell you their e-mail
  • you e-mail them a confimation link with a Guid
  • when they click on the link you know that:

    1. the e-mail is correct
    2. it exists
    3. they own it

Far better than blindly accepting an e-mail address.

Marc Gravell
Good advice, if you're writing a website, doesn't work so well if you're writing an email server / client :-)
Johan
If you're writing an email client or server, then you shouldn't be fake-parsing the only thing you have to parse (pretty much).
Marcin
How do you email them a confirmation without blindly accepting their email address?
janm
@janm: the email server does the validation for you: If the message was delivered (and the link within clicked) the address was valid.
David Schmitt
If you have a trustworth email server and you can get the email address to it reliably, great. (eg. qmail, postfix with Unix style exec(2)). If not, some care must still be taken, like with any data from an untrusted source.
janm
@Johan: replace "click on the link" with "reply to email"
Jason S
A: 

Can anyone provide some insight as to why that is?

Yes, it is an extremely complicated standard that allows lots of stuff that no one really uses today. :)

Are there any known and proven regexps that actually do this fully?

Here is one attempt to parse the whole standard fully...

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

What are some good alternatives to using regexps for matching email addresses?

Using an existing framework for it in whatever language you are using I guess? Though those will probably use regexp internally. It is a complex string. Regexps are designed to parse complex strings, so that really is your best choice.

Edit: I should add that the regexp I linked to was just for fun. I do not endorse using a complex regexp like that - some people say that "if your regexp is more than one line, it is guaranteed to have a bug in it somewhere". I linked to it to illustrate how complex the standard is.

Lars Westergren
Well, no. Regexps are an easy-to-write-quickly way of parsing strings, whether or not complex. They are not designed to handle things that they literally cannot handle because it is mathematically beyond them, or indeed things that require insane, unmaintainable regexes.
Marcin
Is anything designed to handle things mathematically beyond them? :P
Lars Westergren
+5  A: 

There is a context free grammar in BNF that describes valid email addresses in RFC-2822. It is complex. For example:

" @ "@example.com

is a valid email address. I don't know of any regexps that do it fully; the examples usually given require comments to be stripped first. I wrote a recursive descent parser to do it fully once.

janm
+1  A: 

Some flavours of regex can actually match nested brackets (e.g., Perl compatible ones). That said, I have seen a regex that claims to correctly match RFC 822 and it was two pages of text without any whitespace. Therefore, the best way to detect a valid email address is to send email to it and see if it works.

1800 INFORMATION
+1  A: 

Try this one:

"(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"

Have a look here for the details.

However, rather than implementing the RFC822 standard, maybe it would be better to look at it from another viewpoint. It doesn't really matter what the standard says if mail servers don't mirror the standard. So I would argue that it would be better to imitate what the most popular mail servers do when validating email addresses.

Mike Thompson
I posted the same link on a similiar question:http://stackoverflow.com/questions/210945/what-would-be-a-globally-accepted-regular-expression-to-match-e-mail-addresses#211040I found that it explained the situation well!
brass-kazoo
A: 

It's really hard because there are a lot of things that can be valid in an email address according to the Email Spec, RFC 2822. Things that you don't normally see such as + are perfectly valid characters for an email address.. according to the spec.

There's an entire section devoted to email addresses at http://regexlib.com, which is a great resource. I'd suggest that you determine what criteria matters to you and find one that matches. Most people really don't need full support for all possibilities allowed by the spec.

Wayne
-1 for "Most people really don't need full support for all possibilities allowed by the spec."
David Schmitt
@David Schmitt : The addresses: Abc\@[email protected], customer/[email protected] and !def!xyz%[email protected] are all valid.. however 99.99% of people won't run into these types of addresses in a production site.
Wayne
+11  A: 

There are a number of perl modules (for example) that do this. Don't try and write your own regexp to do it. Look at

Mail::VRFY will do syntax and network checks (does and SMTP server somewhere accept this address)

http://search.cpan.org/~jkister/Mail-VRFY-0.58/VRFY.pm

RFC::RFC822::Address - a recursive descent email address parser.

http://search.cpan.org/~abigail/RFC_RFC822_Address-1.5/Address.pm

Mail::RFC822::Address: regexp-based address validation, worth looking at just for the insane regexp

http://ex-parrot.com/~pdw/Mail-RFC822-Address.html

Similar tools exist for other languages. Insane regexp below...

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
mmaibaum
I remember someone saying that regex is both stupid (auto generated), and *wrong*. Does anyone else remember that?
Simon Buchan
That wouldn't surprise me to be honest - that said all the attempts to validate an email via regexp to the actual standard that I've seen have been insane to some degree - I wouldn't even try and understand that one. A regexp a tenth the size of it probably means you shouldn't be using it ;)
mmaibaum
i would file this under "cryptography"
steffenj
Personally I'd rather use a simplified regex than that beast (even if it only handles 99.95% of real cases)... or none at all and do the handshake.
Marc Gravell
@Simon, this is correct. You need to preprocess the string to remove comments before you can even apply this regex, and RFC822 is incredibly obsolete; it's from 1982(!)
Porges
A: 

Quoting and various other rarely used but valid parts of the RFC make it hard. I don't know enough about this topic to comment definitively, other than "it's hard" - but fortunately other people have written about it at length.

As to a valid regex for it, the Perl Mail::Rfc822::Address module contains a regular expression which will apparently work - but only if any comments have been replaced by whitespace already. (Comments in an email address? You see why it's harder than one might expect...)

Of course, the simplified regexes which abound elsewhere will validate almost every email address which is genuinely being used...

Jon Skeet
A: 

Many have tried, and many come close. You may want to read the wikipedia article, and some others.

Specifically, you'll want to remember that many websites and email servers have relaxed validation of email addresses, so essentially they don't implement the standard fully. It's good enough for email to work all the time though.

Johan
+1  A: 

Something interesting about Email regular expression

http://www.codinghorror.com/blog/archives/000214.html

Nikhil Kashyap
A: 

If you're just interested in matching common email patterns, you can have a look at some of the expressions here.

On Freund
+4  A: 

It's not all nonsense though as allowing characters such as '+' can be highly useful for users combating spam, e.g. [email protected] (instant disposable Gmail addresses).

Only when a site accepts it though.

Christopher Galpin
This is fairly common, not only with gmail; I've been doing it for about a decade (I use - rather than + because I prefer it and it's my server so I can, but + is normal).
Mark Baker
A: 

Adding to Waynes answer, there is also a section on www.regular-expressions.info dedicated to email, with a few samples.

You can always question whether it's worth it or if in fact any less-than-100%-covering regexp only contributes to a false sense of security.

In the end, actually sending the email is what will provide the real final validation. (-you'll find out if your mailserver has bugs;-)

conny
+7  A: 

Validating e-mail addresses aren't really very helpful anyway. It will not catch common typos or made-up email addresses, since these tend to look syntactically like valid addresses.

If you want to be sure an address is valid, you have no choice but to send an confirmation mail.

If you just want to be sure that the user inputs something that looks like an email rather than just "asdf", then check for an @. More complex validation does not really provide any benefit.

(I know this doesn't answer your questions, but I think it's worth mentioning anyway)

JacquesB
I think it does answer the question.
bzlm
I also like to check that there is only 1 @ character and that is not the first or last character. When I know that the email address is going to be a "typically" formatted email address (i.e. [email protected]), then also like to check for 1 or more characters after the @ character, followed by a . character ("dot") followed by by atleast 1 or more characters.
Adam Porad
@Adam: If you go down that road you have to do it correctly. See eg. janm's explanation of how you can have more than one @ in a valid email address.
JacquesB
A: 

For completeness of this post, also for PHP there is a language built-in function to validate e-mails.

For PHP Use the nice filter_var with the specific EMAIL validation type :)

No more insane email regexes in php :D

var_dump(filter_var('[email protected]', FILTER_VALIDATE_EMAIL));

http://www.php.net/filter_var

SchizoDuckie
FILTER_VALIDATE_EMAIL is snake oil.
bzlm
A: 

Just to add a regex that is less crazy than the one listed by @mmaibaum:

^[a-zA-Z]([.]?([a-zA-Z0-9_-]+)*)?@([a-zA-Z0-9\-_]+\.)+[a-zA-Z]{2,4}$

It is not bulletproof, and certainly does not cover the entire email spec, but it does do a decent job of covering most basic requirements. Even better, it's somewhat comprehensible, and can be edited.

Cribbed from a discussion at HouseOfFusion.com, a world-class ColdFusion resource.

Ben Doom
That regex doesn't even cover [email protected], let alone [email protected]. If that's someone's idea of a world-class ColdFusion resource, thank $DEITY I don't program in CF.
womble
As stated in my desctiption, it was *not* supposed to be exhaustive. It was supposed to be (relatively) straightforward, and easy to modify.
Ben Doom
Also, are you really going to judge a language based on what a handful of its users came up with years ago to solve something that is no longer a problem in the language?
Ben Doom
+3  A: 

Whether or not to accept bizarre, uncommon email address formats depends, in my opinion, on what one wants to do with them.

If you're writing a mail server, you have to be very exact and excruciatingly correct in what you accept. The "insane" regex quoted above is therefore appropriate.

For the rest of us, though, we're mainly just interested in ensuring that something a user types in a web form looks reasonable and doesn't have some sort of sql injection or buffer overflow in it.

Frankly, does anyone really care about letting someone enter a 200-character email address with comments, newlines, quotes, spaces, parentheses, or other gibberish when signing up for a mailing list, newsletter, or web site? The proper response to such clowns is "Come back later when you have an address that looks like [email protected]".

The validation I do consists of ensuring that there is exactly one '@'; that there are no spaces, nulls or newlines; that the part to the right of the '@' has at least one dot (but not two dots in a row); and that there are no quotes, parentheses, commas, colons, exclamations, semicolons, or backslashes, all of which are more likely to be attempts at hackery than parts of an actual email address.

Yes, this means I'm rejecting valid addresses with which someone might try to register on my web sites - perhaps I "incorrectly" reject as many as 0.001% of real-world addresses! I can live with that.

Matt Hucke
+2  A: 

An easy and good way to check email-adresses in Java is to use the EmailValidator of the Apache Commons Validator library.

I would always check an email-address in an input-form against something like this before sending an email - even if you only catch some typos. You probably don't want to write an automated scanner for "delivery failed" notification mails. :-)

hstoerr
+4  A: 

I've now collated test cases from Cal Henderson, Dave Child, Phil Haack, Doug Lovell and RFC 3696. 158 test addresses in all.

I ran all these tests against all the validators I could find. The comparison is here: http://www.dominicsayers.com/isemail

I'll try to keep this page up-to-date as people enhance their validators. Thanks to Cal, Dave and Phil for their help and co-operation in compiling these tests and constructive criticism of my own validator.

People should be aware of the errata against RFC 3696 in particular. Three of the canonical examples are in fact invalid addresses. And the maximum length of an address is 254 or 256 characters, not 320.

Dominic Sayers
A: 

If you're running on the .NET Framework, just try instantiating a MailAddress object and catching the FormatException if it blows up, or pulling out the Address if it succeeds. Without getting into any nonsense about the performance of catching exceptions (really, if this is just on a single Web form it is not going to make that much of a difference), the MailAddress class in the .NET framework goes through a quite complete parsing process (it doesn't use a RegEx). Open up Reflector and search for MailAddress and MailBnfHelper.ReadMailAddress() to see all of the fancy stuff it does. Someone smarter than me spent a lot of time building that parser at Microsoft, I'm going to use it when I actually send an e-mail to that address, so I might as well use it to validate the incoming address, too.

Nicholas Piasecki
A: 

This class for Java has a validator in it: http://www.leshazlewood.com/?p=23

This is written by the creator of Shiro (formally Ki, formally JSecurity)

DutrowLLC