views:

1472

answers:

5

I'm not asking about full email validation.

I just want to know what are allowed characters in user-name and server parts of email address. This may be oversimplified, maybe email adresses can take other forms, but I don't care. I'm asking about only this simple form: user-name@server (e.g. [email protected]) and allowed characters in both parts.

I know that a-z, 0-9, _, -, . (dot) can be used but is there more allowed characters? Maybe +, $?

Hope I made my question clear ;-).

Update: if some characters are allowed, but only at specific positions (as is dot in user-name), then this character is just allowed. I don't want full email address validation but simple list (set) of all characters allowed at user-name and server parts. Please don't bother with additional contraints for special positions of specific characters. If character is allowed under some/any/none conditions then it is in the list of allowed chars.

Update2: please give as simple answer as possible - just describe available characters, do not just give link to specs.

Update3: Here is my motivation for exactly such question as I asked... My question may seem stupid as I ask for very simplified conditions of valid email addresses. But with proper answer and very simple implementation of validation with only checking if allowed characters were provided (or not) I will accept all valid email addresses (and many invalid too). I think it is more user friendly than the opposite (implement sophisticated validator that rejects some proper email addresses). Consider + that Dan Herbert and laura are talking about in comments. This is just example of too constrained implementation of email address validation - and that is annoying!

Second reason is just curiosity - what are the allowed characters taken apart from other validation constraints?

+18  A: 

See RFC 5322 and, to a lesser extent, RFC 5321.

RFC 822 also covers email addresses, but it deals mostly with its structure:

 addr-spec   =  local-part "@" domain        ; global address     
 local-part  =  word *("." word)             ; uninterpreted
                                             ; case-preserved

 domain      =  sub-domain *("." sub-domain)     
 sub-domain  =  domain-ref / domain-literal     
 domain-ref  =  atom                         ; symbolic reference

And as usual, Wikipedia has a decent article on email addresses:

The local-part of the e-mail address may use any of these ASCII characters:

  • Uppercase and lowercase English letters (a-z, A-Z)
  • Digits 0 to 9
  • Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
  • Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively.

For validation, see this.

The domain part is defined as follows:

The Internet standards (Request for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters a through z (in a case-insensitive manner), the digits 0 through 9, and the hyphen (-). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or blank spaces are permitted.

Anton Gogolev
please give just simple answer - list of possible characters - it's all i'm asking for. i don't have free time to spend on reading specs.
WildWezyr
But we do?... o.O
Filip Ekberg
@WildWzyr, It's not that simple. Email addresses have a lot of rules for what is allowed. It's simpler to refer to the spec than to list out all of them. If you want the complete Regex, check here to get an idea of why it's not so simple: http://www.regular-expressions.info/email.html
Dan Herbert
@Filip Ekberg: I think someone have already studied this docs and can share knowledge. Isn't SO all about sharing knowledge ;-) ?
WildWezyr
@Dan Herbert: I don't want complete regex, just list of allowed characters... Simple straight question, looking for simple answer too ;-).
WildWezyr
there is no simple list, just because you want something simple doesn't mean it will be so. some characters can only be in certain locations and not in others. you can't have what you want all the time.
fuzzy lollipop
@WildWezyr Well, the full-stop character is allowed in the local-part. But not at the start or end. Or with another full-stop. So the answer IS NOT as simple as just a list of allowed characters, there are rules as to how those characters may be used - `[email protected]` is not a valid email address, but `[email protected]` is, even though both use the same characters.
Mark Pim
Also, remember that with internationalized domain names coming in, the list of allowed characters will explode.
Chinmay Kanchi
Please provide second part of answer (in form similar to "The domain of the e-mail address may use any of these characters: ...") and I will accept your answer.
WildWezyr
+1  A: 

You can start from wikipedia article:

  • Uppercase and lowercase English letters (a-z, A-Z)
  • Digits 0 to 9
  • Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
  • Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively.
Vladimir
+1  A: 

Wikipedia has a good article on this and the official spec is here. From wikipdia:

The local-part of the e-mail address may use any of these ASCII characters:

  • Uppercase and lowercase English letters (a-z, A-Z)
  • Digits 0 to 9
  • Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
  • Character . (dot, period, full stop) provided that it is not the first or last character, and provided also >that it does not appear two or more times consecutively.

Additionally, quoted-strings (ie: "John Doe"@example.com) are permitted, thus allowing characters that would otherwise be prohibited, however they do not appear in common practice. RFC 5321 also warns that "a host that expects to receive mail SHOULD avoid defining mailboxes where the Local-part requires (or uses) the Quoted-string form".

Mike Weller
Darn, you where a couple of seconds before me. ;) (I deleted my answer as it was identically with yours)
Stefan
ok, great. now - how about server part of email address?
WildWezyr
@WildWezyr Valid hostnames, which could be an ip address, FQN, or something resolvable to an local network host.
JensenDied
+1  A: 

Hi there!

Check out atext in RFC5322 might be what you are looking for..

Anders
+1  A: 

Watch out! There is a bunch of knowledge rot in this thread (stuff that used to be true and now isn't).

To avoid false-positive rejections of actual email addresses in the current and future world, and from anywhere in the world, you need to know at least the high-level concept of RFC 3490, "Internationalizing Domain Names in Applications (IDNA)". I know folks in US and A often aren't up on this, but it's already in widespread and rapidly increasing use around the world (mainly the non-English dominated parts).

The gist is that you can now use addresses like mason@日本.com and wildwezyr@fahrvergnügen.net. No, this isn't yet compatible with everything out there (as many have lamented above, even simple qmail-style +ident addresses are often wrongly rejected). But there is an RFC, there's a spec, it's now backed by the IETF and ICANN, and--more importantly--there's a large and growing number of implementations supporting this improvement that are currently in service.

I didn't know much about this development myself until I moved back to Japan and started seeing email addresses like hei@やる.ca and Amazon URLs like this:

www.amazon.co.jp/エレクトロニクス-デジタルカメラ-ポータブルオーディオ/b/ref=topnav_storetab_e?ie=UTF8&node=3210981

(And no, ha ha, Stack Overflow couldn't deal with that link. But paste it into a modern Chrome or Safari and try it.)

I know you don't want links to specs, but if you rely solely on the outdated knowledge of hackers on Internet forums, your email validator will end up rejecting email addresses that non-Enlish users increasingly expect to work. For those users, such validation will be just as annoying as the commonplace brain-dead form that we all hate, the one that can't handle a + or a three-part domain name or whatever.

So I'm not saying it's not a hassle, but the full list of characters "allowed under some/any/none conditions" is (nearly) all characters in all languages. If you want to "accept all valid email addresses (and many invalid too)" then you have to take IDN into account, which basically makes a character-based approach useless (sorry), unless you first convert the internationalized email addresses to Punycode.

After doing that you can follow (most of) the advice above.

Mason
Are you sure that this extra characters are sent to and handled by servers? As far as I know internationalized domain names are handled by browsers (protocol clients not servers).
WildWezyr
Right; behind the scenes, the domain names are still just ASCII.But, if your web app or form accepts user-entered input, then it needs to perform the same job that the web browser or mail client does when the user inputs an IDN hostname: to convert the user input into DNS-compatible form. Then validate. Otherwise, these internationalized email addresses will not pass your validation. (Converters like the one I linked to only modify the non-ASCII characters they are given, so it is safe to use them on non-internationalized email addresses (those are just returned unmodified).)
Mason