views:

52

answers:

2

I'm a webdeveloper with an emphasis on server-side programming. What little I've tinkered with JavaScript, I've done with externally referenced files or event handlers, and the barest minimum of an initialising function call between <script> tags.

As such it came as a surprise to me about a week ago that the data between <script> tags is not commonly escaped. In fact... it can't be. Escaping it will throw a massive lolwut-ohnoez-wrench into the works of the JavaScript parser in, as far as I know, every browser on the face of the earth.

This leads us to the (IMO) clusterfuck that is having to use CDATA for documents with in-HTML JavaScript blocks to pass validation (in XHTML), which still breaks hilariously the moment you have ]]> in your code for any arbitrary reason.

As something of an encoding/escaping purist, I get the twitches looking at this. And for several days I've now asked myself:

Why?

Who's idea was it to excempt <script> (and, for example, quite distinctly not the JS-event handlers like onclick) from the otherwise holy rule of 'non-HTML stuff between HTML tags should be HTML escaped', and why? Is it a case of 'this just grew that way historically, it's botched now, deal with it', or did someone sit down and think up something I'm not seeing?

The same is true (though less obviously so) for CSS and the <style> tag.

Do we even know what prompted this - or is it a case of lost knowledge? My google-fu on this topic has been incredibly weak, and I've not found anything, but since this is actually bugging me in pathetically OCD ways, I'd love to hear explanations if anyone has any.

+5  A: 

Because it is very common to want to use characters such as & and < in scripts, and escaping them is a pain.

On the flip side, <script> and <style> can't have child elements, so there is no need to make it easy to include a tag.

The result - HTML defines <script> and <style> as containing CDATA in the DTD, so you don't need to do it manually in the document, thus making life easier.

XHTML is different. In many ways XML is simpler then SGML, and its DTDs don't (as far as I know) have that facility. Hence, you need to be explicit about CDATA markers (or use entities) in XHTML. The only reason it is a "clusterfuck" is because people claim their XHTML is HTML by serving it with a text/html content-type (instead of the correct application/xhtml+xml).

As for intrinsic event attributes, SGML doesn't make it possible to say that special characters should not be treated as such, but when they are used they shouldn't contain much more than a function call … and are better avoided in favour of unobtrusive JS anyway.

David Dorward
You're right about CDATA being an XHTML(/XML) thing. I'll edit that into my question, since I do know it, but I've apparently managed to omit it. D'oh. :) Will respond to the other points in a second (but definitely thanks for responding, and you're probably on to something).
pinkgothic
Right, back to you. :D I didn't realise HTML defines the contents of those tags as `CDATA`; I just thought that JS-blocks passing validation in HTML came from SGML being more lenient than XML! Learn something new every day. (Where is that +2 when you need it?) The other stuff I've sort of rambled to death in MooGoo's general direction, so I'm not going to repeat them here (if that's okay).
pinkgothic
I think the main reason is that `<script>` and `<style>` elements don't need children, so removing the need to escape everything manually (or put it in `CDATA`) seemed logical because you would never *actually* want to write an unescaped `<` inside a script.
musicfreak
@musicfreak: Ooh, great comment for emphasis. Now I'm seeing David's second paragraph in a wholly different light.
pinkgothic
+2  A: 

Because in Javascript you are constantly using characters that would need to be escaped in HTML. That is the point of having CDATA after all isn't it?

Tell me what you think looks more reasonable

if (5 &gt; 4 &amp;&amp; 2 &lt; 3) alert('dude');

Or

if (5 > 4 && 2 < 3) alert('dude');

Also in the vast majority of cases, both CSS and Javascript should be included as links to separate files, rather than inlined in HTML, thus avoiding the escaping issue entirely.

MooGoo
"Also in the vast majority of cases, both CSS and Javascript should be included as links to separate files, rather than inlined in HTML, thus avoiding the escaping issue entirely." I completely agree. :) But I don't feel 'constantly using characters that should be escaped' is ever a valid reason not to escape something - yet, the more I think about it, the more that's probably what happened when it was defined. :/ Do you have an idea why CSS does it? It's far less frequent there.
pinkgothic
By the way, simply because it's probably not what you expect me to say, but nonetheless true, I consider former more reasonable - *in an HTML context*. To me, escaped data will always look more reasonable, that's the aforementioned OCD coming through (even if it is slightly tongue-in-cheek). ;) Nonetheless, I do understand what you're getting at. And agree that it's probably the motivation.
pinkgothic
I would say that `<![CDATA[ ... ]]>` simply changes what characters need to be escaped. As you said, any instance of `]]>` would break validation, so if it were to be included in the JS code, it would have to be represented in a way that does not conflict with the "host" language, thus...escaping it. Probability of `]]>` appearing in most Javascript code is pretty low, so it is a reasonable trade-off.
MooGoo
MooGoo
'so anything to make it easier and less error-prone is a good thing I think' - that's kind of why the 'oh we won't escape it here!' bugs me, though, because it starts adding exceptions to what is otherwise a pretty clear rule. (I don't like HEREDOC either, to be honest.) And I can live with how it is now (especially after your and David's helpful answers), but it's hard not to consider it hack... -ish.
pinkgothic
Pre-processing the HTML server-side is actually something I wanted to do, that got foiled by this behaviour. I thought it might be an option to HTML-escape ), (, } and { (and only those) across the entire output-document to prevent cross-origin CSS attacks, but that failed because of JS blocks. Well. Things in life are never that easy. ;) And that would have broken if we'd had non-implicit CDATA blocks somewhere, so it's probably good that *this* threw a wrench in the works. It just surprised me.
pinkgothic
Yea...clear rules often seem to give way to convenience; such is life, especially in the relm of computers. I hear what you are saying about reasonable looking HTML. To me, almost any CSS/JS in HTML is ugliness. Perfectly indented HTML on the other hand is a thing of beauty.
MooGoo