tags:

views:

634

answers:

3

Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs? If it doesn't define is there a way define which encoding is used? It seems that most browsers send the data in utf-8.

A: 

As far as I'm aware, there is no way to define it, though I've always assumed that it is ASCII, since that is what DNS is (currently, though localised DNS is coming, with all the problems that entails).

Note: UTF8 is "ASCII compatible" unless you try to use extended characters. This probably plays some small part in the reasoning behind why some browsers might send their GET data UTF8 encoded.

EDIT: From your comment, it seems like you don't know how the % encoding works at all, so here goes.

Given the following string query string, "?foo=Hello World!", the "Hello World!" part needs URL encoding. The way this works is any 'special' characters get their ASCII value taken and converted to hex prefixed by a '%'. So the above string would convert to "?foo=Hello%20World%21".

Matthew Scharley
I meant special characters in request parameters like in http://foo/page.php?name=%12%34foo.
JtR
I think ISO-8859 is also the compatible with ascii in case you don't use anything missing from ascii. My firefox at least seems to send iso-8859-1 as a default accept-charset parameter in requests. After changing the default encoding in about:config it still sends get requests in utf-8.
JtR
`Accept-Charset` only affects the returned pages encoding, not the request itself's. And I was refering to every character in the GET query, not just the hostname, or some other part.
Matthew Scharley
How did you come to conclusion that I don't know how URI escaping works?
JtR
A: 

Per RFC 2616,

   CHAR           = <any US-ASCII character (octets 0 - 127)>

and

 token          = 1*<any CHAR except CTLs or separators>
   separators     = "(" | ")" | "<" | ">" | "@"
                  | "," | ";" | ":" | "\" | <">
                  | "/" | "[" | "]" | "?" | "="
                  | "{" | "}" | SP | HT

and URIs are tokens with various specific separators. So, in theory, nothing but US-ASCII should be there. (In practice, since the ISO-8859-1 extension to US-ASCII is used in many other spots in the HTTP specs, it's not unusual to find HTTP implementations which support ISO-8859-1 rather than just US-ASCII, but strictly speaking that's not standards-compliant HTTP).

Alex Martelli
+3  A: 

Does the HTTP standard or something define which encoding should be used on special characters before they are encoded in url with %XXs?

The HTTP standard, no. But another standard, IRI, can come into play.

URIs are explicitly (once %-decoded) byte sequences. What Unicode characters those bytes map onto is not specified by the URI standard or the HTTP standard for http:-scheme URIs.

Specifically for query parameters: web browsers will use the encoding of the originating page to make a form submission GET URL, so if you have a page in ISO-8859-1 and you put ‘é’ in a search box you'll get ‘?search=%E9’, but if you do the same in a page encoded as UTF-8 you'll get ‘?search=%C3%E9’. If you don't serve your form page with any particular charset the browser will guess, which you don't want as it'll make it impossible to guess what format the submission is going to come in as.

For the other parts of a URL, a browser won't generate them itself, but if you supply it with non-ASCII characters in links it will usually encode them as UTF-8. This is not reliable as it depends on browser and locale settings, so it's best not to use this at the moment.

The standard that properly allows non-ASCII characters in links is IRI. IRI converts to URI by UTF-8-%-encoding most of the URL, but the hostname is converted using Punycode instead. For compatibility it is best not to rely on browsers understanding IRIs in links yet. Instead, UTF-8-then-%-encode your path and parameter characters yourself. They will still appear as the right characters in the address bar in modern browsers; unfortunately IE won't display the decoded-character IRI form in all cases, depending on language settings.

The Wiki IRI for the Greek gamma character is:

http://en.wikipedia.org/wiki/Γ

Encoded into a URI, it is:

http://en.wikipedia.org/wiki/%CE%93
bobince
Where did you find out that browser sends data in the encoding it received the form? My firefox and chrome really seem to work that way when I change the content charset information.
JtR
It's just one of those behaviours that has always been followed, back as far as early Netscape. According to the specs the submission encoding should be controlled by `accept-charset` and communicated to the server in multipart form-data sub-headers, but in practice IE gets `accept-charset` dangerously wrong and no browser sends form-data sub-headers, so we are stuck with this situation of relying on the form encoding. Oh well, one day everyone will just use UTF-8 and everything will just work. One century...
bobince