views:

1161

answers:

10

We are designing a URL system that will specify application sections as words separated by slashes. Specifically, this is in GWT, so the relevant parts of the URL will be in the hash (which will be interpreted by a controller layer on the client-side):

http://site/gwturl#section1/section2

Some sections may need additional attributes, which we'd like to specify with a :, so that the section parts of the URL are unambiguous. The code would split first on /, then on :, like this:

http://site/gwturl#user:45/comments

Of course, we are doing this for url-friendliness, so we'd like to make sure that none of these characters which will hold special meaning will be url-encoded by browsers, or any other system, and end up with a url like this:

http://site/gwturl#user%3A45/comments <--- BAD

Is using the colon in this way safe (by which I mean won't be automatically encoded) for browsers, bookmarking systems, even Javascript or Java code?

+2  A: 

I wouldn't count on it. It'll likely get url encoded as %3A by many user-agents.

Asaph
*Many* user agents?
arbales
@arbales: Yes. Some less compliant user-agents will leave non-compliant urls unadorned.
Asaph
+2  A: 

I can't find the right RFC offhand, but my gut says it's not valid and you shouldn't do it even if it were. You will be running the risk that even if a colon in a hash shouldn't be encoded according to standards, it will be in the real world.

Pekka
+1  A: 

From URLEncoder javadoc:

For more information about HTML form encoding, consult the HTML specification.

When encoding a String, the following rules apply:

  • The alphanumeric characters "a" through "z", "A" through "Z" and "0" through "9" remain the same.
  • The special characters ".", "-", "*", and "_" remain the same.
  • The space character " " is converted into a plus sign "+".
  • All other characters are unsafe and are first converted into one or more bytes using some encoding scheme. Then each byte is represented by the 3-character string "%xy", where xy is the two-digit hexadecimal representation of the byte. The recommended encoding scheme to use is UTF-8. However, for compatibility reasons, if an encoding is not specified, then the default encoding of the platform is used.

That is, : is not safe.

axtavt
+3  A: 

Colon isn't safe. See here

Bob
+3  A: 

I don't see Firefox or IE8 encoding some of the Wikipedia URLs that include the character.

kprobst
Opera also keeps the semi-colon, but counting on such behavior is not a good thing to do
Veger
Renesis is talking about the URL fragment and not the URL path.
Gumbo
Wikipedia was one of my thoughts when writing this question. Is its use of colons technically invalid/unsafe then? I commonly see ( and ) in Wikipedia URLs encoded, but never the colon, which left me a bit confused.
Renesis
The Wayback Machine has a : in many of its links - e.g. http://web.archive.org/web/20080822150704/http://stackoverflow.com/
barrowc
A: 

Better to avoid it. A comma is preferable, for example: example/key,value/key,value

Or a slash and calculate which ones are keys and values.

enbuyukfener
But according to http://www.blooberry.com/indexdot/html/topics/urlencoding.htm, the "," falls in the same class as ":". Is there some practical decision by browser-makers to treat one or the other any different? Or even to ignore encoding on both in the path or the hash?
Renesis
Well picked up. I made an incorrect assumption based on ":" having a special purpose in URLs (delimiting usernames and passwords) and "," not having one as far as I know (that, and the fact that commas are used on several major sites, for example: example.com/1234,5678)
enbuyukfener
+1  A: 

It is not a safe character and is used to distinguish what port you connect to when it is right after your domain name

RHicke
+14  A: 

I recently wrote a URL encoder, so this is pretty fresh in my mind.

http://site/gwturl#user:45/comments

All the characters in the fragment part (user:45/comments) are perfectly legal for RFC 3986 URIs.

The relevant parts of the ABNF:

fragment      = *( pchar / "/" / "?" )
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Apart from these restrictions, the fragment part has no defined structure beyond the one your application gives it. The scheme, http, only says that you don't send this part to the server.


EDIT:

D'oh!

Despite my assertions about the URI spec, irreputable provides the correct answer when he points out that the HTML 4 spec restricts element names/identifiers.

Note that identifier rules are changing in HTML 5. URI restrictions will still apply (at time of writing, there are some unresolved issues around HTML 5's use of URIs).

McDowell
I think you are on to something, can you explain this a little further? Not sending this to the server is not an issue, as we are using GWT. I'm just not sure I'm clear on the syntax specified by the section you quoted.
Renesis
But `:` is a gen-delim, not a sub-delim.
bobince
The semi-colon is legal for a pchar, so whether it is in sub-delim or gen-delim is not an issue
Veger
@bobince - `:` is in `pchar`, which is in `fragment`, so `:` is allowed. @Renesis - Wikipedia has an article on ABNF http://en.wikipedia.org/wiki/ABNF You are basically looking at a list of allowed characters, where `/` means _OR_. I haven't done any GWT programming, so I don't know how it uses the fragment part of URIs.
McDowell
One last question -- do you have any insight into the real-world application of this specification? Does this mean browsers should/will ignore (skip the encoding of) the `:` in the fragment?
Renesis
It's important that people realise this is the correct answer; everyone else is saying it isn't valid, but *it is after the '#' symbol, so it is*.
Noon Silk
@Renesis - I had forgotten about the HTML 4 limitations - see this answer: http://stackoverflow.com/questions/2053132/is-a-colon-safe-for-friendly-url-use/2053640#2053640
McDowell
A: 

Colons are used as the split between username and password if a protocol requires authentication.

Joseph Silvashy
+5  A: 

In addition to McDowell's analysis on URI standard, remember also that the fragment must be valid HTML anchor name. According to http://www.w3.org/TR/html4/types.html#type-name

ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

So you are in luck. ":" is explicitly allowed. And nobody should "%"-escape it, not only because "%" is illegal char there, but also because fragment much match anchor name char-by-char, therefore no agent should try to temper with them in anyway.

However you have to test it. Web standards are not strictly followed, sometimes the standards are conflicting. For example HTTP/1.1 RFC 2616 does not allow query string in the request URL, while HTML constructs one when submitting a form with GET method. Whichever implemented in the real world wins at the end of the day.

irreputable
@irreputable - yes, you are quite right.
McDowell