views:

202

answers:

6

I tried to find this in the relevant RFC, IETF RFC 3986, but couldn't figure it.

Do URIs for HTTP allow Unicode, or non-ASCII of any kind?

Can you please cite the section and the RFC that supports your answer.

NB: For those who might think this is not programming related - it is. It's related to an ISAPI filter I'm building.


Addendum

I've read section 2.5 of RFC 3986. But RFC 2616, which I believe is the current HTTP protocol, predates 3986, and for that reason I'd suppose it cannot be compliant with 3986. Furthermore, even if or when the HTTP RFC is updated, there still will be the issue of rationalization - in other words, does an HTTP URI support ALL of the RFC3986 provisos, including whatever is appropriate to include non US-ASCII characters?

+3  A: 

Here is an example: ☃.net.

In terms of the relevant section of RFC 3986, I think you are looking at 2.5.

EDIT:

Apparently stack overflow doesn't detect this as a proper URL. You'll have to copy&paste into your browser.

mlsteeves
I'm not clear on your answer. Are HTTP URIs with non US-ASCII characters supported, or not? Providing one example isn't "support". Also, I'm clear on RFC3986. I mean I read section 2.5. But RFC 2616, which I believe is the current HTTP protocol, predates 3986, and for that reason I'd suppose it cannot be compliant with 3986. Furthermore, even if/when the HTTP RFC is updated, there still will be the issue of rationalization - in other words, does an HTTP URI support *ALL* of the RFC3986 provisos, including whatever is appropriate to include non US-ASCII characters?
Cheeso
So for me, your response here provides information, but not an actual *answer.* Also - just as a side note, I couldn't get that URL to work, in any browser, no matter what I did.
Cheeso
The HTTP RFC *is* being updated, and it will reference RFC 3986, see the IETF HTTPbis WG's home page.
Julian Reschke
A: 

Used to be that non english characters were not allowed in DNS and URL/URI. There was a hack to allow them by using % encoding in URI. However many countries such us russia and china are starting to implement DNS using non latin characters. Here is a reference to one of these standards

Vlad
“non english” → “non-ASCII”. There are many English-language characters that were also not valid in domain names.
bignose
So my takeaway from this is that... the standards are #1, still evolving, and #2, still being adopted. In other words, support for non-US-ASCII characters in HTTP URIs isn't solid yet. Would that be accurate?
Cheeso
no, that's not accurate. URIs do not contain non-ASCII characters. By definition. Ever.IRIs (RFC 3987) do. You can map IRIs to URIs. HTTP only uses URIs on the wire.
Julian Reschke
+2  A: 

http://en.wikipedia.org/wiki/Internationalized_domain_name

dan04
very helpful. . .. .
Cheeso
A: 

Many browsers are not support URIs with Unicode characters (I've implemented them on a website I've build called -- blogvani.com) and Google duly scans and keeps them intact. I don't think that works on top-level domains though, at least not with the registrar and not directly.

For top-level domains if you have a domain registered in Unicode (for example people can register domains in Hindi), it will be converted to a corresponding code in ASCII (something that may go like jdhfks3243-32434.com)...

It is quite funny to see how this is routed and to realize that you're not actually going to a unicode domain even though it seems like that.

Cyril Gupta
+1  A: 

No, they are not allowed. Just check the ABNF in RFC 3986.

Julian Reschke
and from your comment on the other answer: *URIs do not contain non-ASCII characters. By definition. Ever. IRIs (RFC 3987) do. You can map IRIs to URIs. HTTP only uses URIs on the wire.*
Cheeso
A: 

RFC 3986 is being replaced with RFC 3987, which fully supports Unicode, and provides mappings rules to/from RFC 3986 style URIs.

Remy Lebeau - TeamB
RFC 3987 (IRI) is not a replacement of RFC 3986 (URI). Better think of it as something layered on top.
Julian Reschke
Not layered on top of, but defined to the side of it. IRIs mirror the structure of URIs, but are not based on it. IRI is a stand-alone scheme, with Section 3 defining now to move between the two schemes when needed. I said it was a replacement because many systems that previously relied on URIs before have been updated to rely on IRIs instead.
Remy Lebeau - TeamB