ansaurus

Question

Encoding rules for URL with the `javascript:` pseudo-protocol?

Answer 1

A:

My findings, so far:

First, there are the rules for writing a valid HTML attribute value: but here the standard only requires (if the attribute value if enclosed in quotes) an arbitrary CDATA (actually a %URI, but HTML itself does not impose additional validation at its level: any CDATA will validate).

Some examples:

 <a href="javascript:alert('Hi!')">     (1)
 <a href="javascript:if(a > b && 1 < 0) alert(  b ? 'hi' : 'bye')">   (2)
 <a href="javascript:if(a&gt;b &amp;&&amp; 1 &lt; 0) alert( b ? 'hi' : 'bye')">  (3)

Example (1) is valid. But also example (2) is valid HTML 4.01 Strict. To make it valid XHTML we only need to escape the XML special characters < > & (example 3 is valid XHTML 1.0 Strict).

Now, is example (2) a valid javascript: URI ? I'm not sure, but I'd say it's not.

From RFC 2396: an URI is subject to some addition restrictions and, in particular, the escape/unescape via %xx sequences. And some characters are always prohibited: among them spaces and {}# .

The RFC also defines a subset of opaque URIs: those that do not have hierarchical components, and for which the separating charactes have no special meaning (for example, they dont have a 'query string', so the ? can be used as any non special character). I assume javascript: URIs should be considered among them.

This would imply that the valid characters inside the 'body' of a javascript: URI are

 a-zA-Z0-9 
 _|. !~*'();?:@&=+$,/-   
 %hh : (escape sequence, with two hexadecimal digits)

with the additional restriction that it can't begin with /. This stills leaves out some "important" ASCII characters, for example

{}#[]<>^\

Also % (because it's used for escape sequences), double quotes " and (most important) all blanks.

In some respects, this seems quite permissive: it's important to note that + is valid (and hence it should not be 'unescaped' when decoding, as a space).

But in other respects, it seems too restrictive. Braces and brackets, specially: I understand that they are normally used unescaped and browsers have no problems.

And what about spaces? As braces, they are disallowed by the RFC, but I see no problem in this kind of URI. However, I see that in most bookmarklets they are escaped as "%20". Is there any (empirical or theorical) explanation for this?

I still don't know if there are some standard functions to make this escape/unescape (in mainstream languages) or some sample code.

leonbloy 2010-07-15 19:52:12

ansaurus

tags:

views:

answers:

Encoding rules for URL with the `javascript:` pseudo-protocol?

related questions