This question concerns the characters in the query string portion of the URL, which appear after the ?
mark character.
Per Wikipedia, certain characters are left as is and others are encoded (usually with a %
escape sequence).
I've been trying to track this down to actual specifications, so that I understand the justification behind every bullet point in that Wikipedia page.
Contradiction Example 1:
The HTML specification says to encode space as +
and defers the rest to RFC1738. However, this RFC says that ~
is unsafe and furthermore that "[a]ll unsafe characters must always be encoded within the URL". This seems to contradict Wikipedia.
In practice, IE8 encodes ~
in the query strings it generates, while FF3 leaves it as is.
Contradiction Example 2:
Wikipedia states that all characters that it does not mention must be encoded. !
is not mentioned in Wikipedia. But RFC1738 states that !
is a "special" character and "may be used unencoded". This seems to contradict Wikipedia which says that it must be encoded.
In practice, IE8 encodes !
in the query strings it generates, while FF3 leaves it as is.
I understand that the moral of this is probably going to be to encode those characters that are in doubt between Wikipedia and the specifications. Perhaps even going as far as encoding everything that is not [A-Za-z0-9]. I would just like to know the actual standards on this.
Conclusions
The algorithm described on Wikipedia encodes precisely those characters which are not RFC3986 unreserved characters. That is, it encodes all characters other than alphanumerics and -._~
. As a special case, space is encoded as +
instead of %20
per RFC3986.
Some applications use an older RFC. For comparison, the RFC2396 unreserved characters are alphanumerics and !'()*-._~
.
For comparison, the HTML5 working draft algorithm encodes all characters other than alphanumerics and *-._
. The special case encoding for space remains +
. Notable differences are that *
is not encoded and ~
is encoded. (Technically, this handling of *
is compatible with RFC3986 even though *
is in reserved
because it is in the sub-delims
which are allowed in the query
production.)