tags:

views:

827

answers:

2

RFC 1738 specifies the syntax for URL's, and mentions that

URLs are written only with the graphic printable characters of the
US-ASCII coded character set. The octets 80-FF hexadecimal are not
used in US-ASCII, and the octets 00-1F and 7F hexadecimal represent
control characters; these must be encoded.

It does not, however, say what code set these octets then represent.

RFC 2396 seems to try and improve on the situation, but:

For original character sequences that contain non-ASCII characters, however, the situation is more difficult. Internet protocols that transmit octet sequences intended to represent character sequences are expected to provide some way of identifying the charset used, if there might be more than one [RFC2277]. However, there is currently no provision within the generic URI syntax to accomplish this identification. An individual URI scheme may require a single charset, define a default charset, or provide a way to indicate the charset used.

It is expected that a systematic treatment of character encoding within URI will be developed as a future modification of this specification.

Is there any unambigous way in which a client can determine in which character set to interpret encoded octets, or in which a server can determine what a client used to encode with ?

It looks to me like most servers default to UTF-8, but this seems to be a de facto choice more than a specified one.

+4  A: 

As per your quote, URLs are ASCII. that's all.

URIs OTOH, allow for bigger charsets; usually UTF-8 as you said yourself.

the point to remember is that URLs are a subset of URIs. therefore, the real question is, which of these is what you write in a browser? i'd guess you can write an URI, and the browser should try it's best to transform to an URL (which is what HTTP/1.1 support, AFAICR). for non-ASCII characters, that mean hexcodes, usually coding UTF-8.

Javier
A: 

I believe the specification you are looking for is RFC 3987, which describes IRIs - Internationalized Resource Identifiers.

Jim