tags:

views:

130

answers:

3

When outputting HTML, there are several different places where text can be interpreted as control characters rather than as text literals. For example, in "regular" text (that is, outside any element markup):

<div>This is regular text</div>

As well as within the values of attributes:

<input value="this is value text">

And, I believe, within HTML comments:

<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->

Each of these three kinds of text has different rules for how it must be escaped in order to be treated as non-markup. So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters? The above contexts clearly have different rules about what needs to be escaped.

The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup? For example, in theory you only need to escape ' and " in attribute values, since within an attribute value only the closing-delimiter character (' or " depending on which delimiter the attribute value started with) would have control meaning. Similarly, within "regular" text only < and & have control meaning. (I realize that not all HTML parsers are identical. I'm mostly interested in what is the minimum set of characters that need escaping in order to appease a spec-conforming parser.)

Tangentially: The following text will throw errors as HTML 4.01 Strict:

<a href="http://example.com/file.php?x=1&amp;y=2"&gt;foo&lt;/a&gt;

Specifically, it says that it doesn't know what the entity "&y" is supposed to be. If you put a space after the &, however, it validates just fine. But if you're generating this on the fly, you're probably not going to want to check whether each use of & will cause a validation error, and instead just escape all & inside attribute values.

+1  A: 

The above contexts clearly have different rules about what needs to be escaped.

I'm not sure that the different elements have different encoding rules like you say. All the examples you list require the HTML encoding.

E.g.

<h1>Fish &amp; Chips</h1>
<img alt="Awesome picture of Meat Pie &amp; Chips" />
<a href="products.aspx?type=1&amp;meal=fish%20%26%20chips&amp;page=1">Fish &amp; Chips</a>

The last example includes some URL Encoding for the ampersand too (&) and its at this point things get hairy (sending an ampersand as data, which is why it must be encoded).

So my first question is, are there any other contexts in HTML in which characters can be interpreted as markup/control characters?

Anywhere within the HTML document, if the control characters are not being used as control characters, you should encode them (as a good rule of thumb). Most of the time, its HTML Encoding, & or > etc. Othertimes, when trying to pass these characters via a URL, use URL Encoding %20, %26 etc.

The second question is, what are the canonical, globally-safe lists of characters (for each context) that need to be escaped to ensure that any embedded text is treated as non-markup?

I'd say that the Wikipedia article has a few good comments on it and might be worth a read - also the W3 Schools article I guess is a good point. Most languages have built in functions to prepare text as safe HTML, so it may be worth checking your language of choice (if you are indeed even using any scripting languages and not hand coding the HTML).

Specifically, Wikipedia says: "Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references &lt;, &gt;, &quot; and &amp;, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters."

For URL Encoding, this article seems a good starting point.

Closing thoughts as I've already rambled a bit: This is all excluding the thoughts of XML / XHTML which brings a whole other ballgame to the court and its requirement that pretty much the world and its dog needs to be encoded. If you are using a scripting language and writing out a variable via that, I'm pretty sure it'll be easier to find the built in function, or download a library that'll do this for you. :) I hope this answer was scoped ok and didn't miss the point or question or come across in the wrong tone. :)

Amadiere
They do have different escaping rules; you don't need to escape a < inside an attribute value because it has no control functionality in that context, but you do need to escape it in regular text because there, < has control functionality.Conversely, in regular text, " and ' have no control functionality, but they do inside an attribute value. Now, it doesn't HURT to escape ' and " in regular text, but it's unnecessary. So perhaps it would be best to just establish a list of every control character in any context in HTML, and always escape all of them.
dirtside
Agreed. I'd just encode all the time as reducing your chances of a mistake slipping through I guess. :)
Amadiere
A: 

If you are this concerned about the validity of the final HTML, you might consider constructing the HTML via a DOM, versus as text.

You don't say what environment you are targeting.

Chase Seibert
I'm not concerned with practice here, only with theory. The behavior of different web browsers *aside*, what theoretically is the best practice for escaping?
dirtside
Or rather, the best *theory* ;-)
dirtside
+4  A: 
<div>This is regular text</div>

Text content: & must be escaped. < must be escaped.

If producing a document in a non-UTF encoding, characters that do not fit inside the chosen encoding must be escaped.

In XHTML (and XML in general), the sequence ]]> must not occur in text content, so in that specific case one of the characters in that sequence must be escaped, traditionally the >. For consistency, the Canonical XML specification chooses to escape > every time in text content, which is not a bad strategy for an escaping function, though you can certainly skip it for hand-authoring.

<input value="this is value text">

Attribute values: & must be escaped. The attribute value delimiter " or ' must be escaped. If no attribute value delimiter is used (don't do that) no escape is possible.

Canonical XML always chooses " as the delimiter and therefore escapes it. The > character does not need to be escaped in attribute values and Canonical XML does not. The HTML4 spec suggested encoding > anyway for backwards compatibility, but this affects only a few truly ancient and dreadful browsers that no-one remembers now; you can ignore that.

In XHTML < must be escaped. Whilst you can get away with not escaping it in HTML4, it's not a good idea.

To include tabs, CR or LF in attribute values (without them being turned into plain spaces by the attribute value normalisation algorithm) you must encode them as character references.

For both text content and attribute values: in XHTML under XML 1.1, you must escape the Restricted Characters, which are the Delete character and C0 and C1 control codes, minus tab, CR, LF and NEL. In total, [\x01-\x08\x0B\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]. The null character may not be included at all even escaped in XML 1.1. Outside XML 1.1 you can't use any of these characters at all, nor is there a good reason you'd ever want to.

<!-- This text here might be programmatically generated
and could, in theory, contain the double-hyphen character
sequence, which is verboten inside comments -->

Yes, but since there is no escaping possible inside comments, there is nothing you can do about it. If you write <!-- &lt; -->, it literally means a comment containing “ampersand-letter l-letter t-semicolon” and will be reflected as such in the DOM or other infoset. A comment containing -- simply cannot be serialised at all.

<![CDATA[ sections and <?pi​s in XML also cannot use escaping. The traditional solution to serialise a CDATA section including a ]]> sequence is to split that sequence over two CDATA sections so it doesn't occur together. You can't serialise it in a single CDATA section, and you can't serialise a PI with ?> in the data.

CDATA-elements like <script> and <style> in HTML (not XHTML) may not contain the </ (ETAGO) sequence as this would end the element early and then error if not followed by the end-tag-name. Since no escaping is possible within CDATA-elements, this sequence must be avoided and worked around (eg. by turning document.write('</p>') into document.write('<\/p>');. (You see a lot of more complicated silly strategies to get around this one, like calling unescape on a JS-%-encoded string; even often '</scr'+'ipt>' which is still quite invalid.)

There is one more context in HTML and XML where different rules apply, and that's in the DTD (including the internal subset in the DOCTYPE declaration, if you have one), where the % character has Special Powers and would need to be escaped to be used literally. But as an HTML document author it is highly unlikely you would ever need to go anywhere near that whole mess.

The following text will throw errors as HTML 4.01 Strict:

<a href="http://example.com/file.php?x=1&amp;y=2"&gt;foo&lt;/a&gt;

Yes, and it's just as much an error in Transitional.

If you put a space after the &, however, it validates just fine.

Yes, under SGML rules anything but [A-Za-z] and # doesn't start parsing as a reference. Not a good idea to rely on this though. (Of course, it's not well-formed in XHTML.)

bobince
Thanks. Actually wasn't concerned about XML here, just HTML, but you verified what I thought to be the case already.
dirtside