ansaurus

Question

Proper entity types for XHTML, XML and inside inline JavaScript

Answer 1

A:

First, the way I understand it, it's more appropriate to use numeric entities in an XHTML document, such as " instead of ", is that right?

Not exactly.

There are two issues to worry about.

Is this going to be plain old XHTML or is it going to be HTML compatible XHTML?

There is no ' is HTML, so you can't use it in HTML compatible XHTML (but you only need to use it in attribute values delimited with an ', so just use " instead.

Is this going to be processed with an XML parser that is not DTD aware?

If so, only the generic XML entities will be recognized (quot, apos, gt, lt, amp).

On the other hand, named entities are much more readable. Real characters (e.g. via UTF-8) are most readable.

Second, for my RSS XML feed, which entity type is correct?

Use quot, gt, lt, amp where needed and real characters elsewhere.

Third, which of the following is correct for entities inside inline JavaScript?

Better to use unobtrusive JS instead of intrinsic event attributes.

That said, the rules are the same as for any other HTML attribute — only & and whatever character you used to delimit the attribute value need to be represented with an entity.

David Dorward 2009-11-14 16:22:03

Thanks David. So for Q1: I assume the answer is "use named entities." Q2: Is there any harm in using the numeric entities on my feed? Q3: Did not understand the answer.

Jeff 2009-11-14 16:35:31

Answer 2

A:

First, the way I understand it, it's more appropriate to use numeric entities in an XHTML document, such as " instead of ", is that right?

" is also defined for XHTML. So you can use both.

Second, for my RSS XML feed, which entity type is correct? Named or numeric? I believe it's numeric, but see examples of both in my searches.

Again, " is also defined for XML. So you can use both.

Third, which of the following is correct for entities inside inline JavaScript?

The second one is correct since a plain < is not allowed inside an attribute value declaration (but > is).

Edit Now that you refined your question:

I would use a charset that contains all characters I need. So if you want to be able to use almost any character, use Unicode and encode the characters with UTF-8.

Thereby you can encode any character with UTF-8 directly and have no need to use character references for characters other than the special characters of XML (at least &, >, " and ').

And here you have the free choice between the named or numeric character references. Use what you like better or what your programming language uses/prefers.

Gumbo 2009-11-14 16:42:11

Thanks Gumbo. If an entity is not defined for XML, such as ` `, should I be using the numeric entity for my XML RSS feed?

Jeff 2009-11-14 18:45:17

@Jeff: Yes, the numerical character references do always work.

Gumbo 2009-11-14 19:04:41

Thanks for the edit. My server and code are both using UTF8. Just to be clear, are you telling me that it's okay for me to use any of my options above? Specifically 3?

Jeff 2009-11-14 21:16:20

@Jeff: Yes, you can use all three options as long as the named character references are defined. And as for XML that are `>`, `<`, `` and `"` (see http://www.w3.org/TR/xml/#sec-predefined-ent).

Gumbo 2009-11-14 21:27:33

I read a little bit about "defining" entities for XML, but wasn't sure what it meant. If it requires me to define `" = "` in my feeds' XML head, I think I'd rather just use the numeric entities. Do I have that right?

Jeff 2009-11-14 21:45:25

@Jeff: No, the five named ones are already defined for XML. But if you would want to use other entities you would have to define them.

Gumbo 2009-11-14 21:49:13

Got it! Many thanks!

Jeff 2009-11-14 22:50:14

Answer 3

A:

<, & and " in attribute values where " is the delimiter: use <, & and ", respectively.

These are predefined entities in XML so will work with any parser regardless of whether it reads the document type. They are also normal defined entities in HTML.

Numeric character references are just as valid, but slightly harder to read.

> in text content: use > or leave as -is.

> doesn't normally need escaping, it's perfectly legal in an attribute value at all times, and it's legal in text content as long as it doesn't form part of a ]]> sequence. (This is an obscure, pointless and sometimes-ignored part of the XML spec.) You might prefer to always escape it in text content anyway, just to be safe and not have to remember this rule. (That's what Canonical XML does.)

Numeric character references are just as valid, but slightly harder to read.

' in attribute values where ' is the delimiter: use '.

The numeric character reference is most correct here, because the XML predefined entity ' isn't technically defined by the HTML4 standard (even though it will work in all current browsers). The lateness of adding this entity reflects the common practice of always using " as the attribute value delimiter.

non-ASCII characters: include as-is

As long as you're using and declaring UTF-8 you can just spit the characters straight out. Smaller, more readable results.

non-ASCII characters (without Unicode): use numeric character reference

If for some reason you can't use UTF-8 (boooo!!!), use a character reference like é in preference to the HTML entities. The HTML entities only cover a very small portion of the Unicode character set anyway; might as well use them for all IMO. I personally prefer to use the &#x... hex-escapes for the non-ASCII characters as it is traditional to refer to Unicode characters by their ‘U+xxxx’ hex code.

Though using the HTML entities is quite valid in an XHTML document, it means the parser has to fetch external entities such as the DTD to work out what the entities are. If you stick to the predefined entities and character references you can use a lightweight non-external-entity-including XML parser without losing your ability to find text-including-entity-references in the document.

The situation with RSS is murky, as usual with all the different RSS versions lurking about. RSS 0.91 had a DTD that included the older HTML 3.2 standard's entities, but the previous official SYSTEM URL for the DTD has gone walkies. (In an annoying and needless piece of internet vandalism, Netscape's owners, AOL, broke the link in a reorg a few years ago. Not only that but they also 302 you to their home page if you try to access it or any other address on the old site, thus serving a badly-written HTML page to clients expecting a DTD. Bad AOL, 302-404s are so bogus.)

RSS 2.0 doesn't have an official DTD at all. So either way, avoid the HTML entities, using the predefined entities and the numeric character references in preference.

onmouseover="tooltip_on( '<strong>Tool...

Not allowable in any document type. < is invalid in an attribute value.

onmouseover="tooltip_on( '<strong>Tooltip...

Valid but unreadable. I second David's Unobtrusive JavaScript suggestion.

bobince 2009-11-14 19:02:24

I don't know if you're going to be famous for this reply =), but I'm definitely going to look into an alternative method for my JS. Is JQuery "unobtrusive"?

Jeff 2009-11-14 19:44:24

Yeah, I only get votes for glib post-pub nonsense posts, people hate long involved answers. ;-) jQuery (or any other framework) isn't inherently ‘unobtrusive’, but it's certainly common to use it that way, using selectors to choose elements and bind event handlers to them instead of using inline HTML event handler attributes.

bobince 2009-11-14 20:20:57

ansaurus

tags:

views:

answers:

Proper entity types for XHTML, XML and inside inline JavaScript

related questions