ansaurus

Question

XML attribute-value normalisation - how should whitespace in entities be treated?

Answer 1

+1 A:

So the question is, is the replacement text of the entity a carriage-return character, or is it the character entity which represents a carriage-return character?

And if you look at the examples in Appendix D of the XML Recommendation (especially the one described as "a more complex example") it appears the replacement text (in your example) should be a carriage-return character, and not the character entity. Which means that your "Java test" is the correct one. At least, that's if my interpretation of the appendix is correct.

However note that Appendix D is non-normative, which means you would have to read the body of the Recommendation to find out the actual rules. I believe that's section 4.4, but that table just made my head hurt.

Paul Clapham 2010-01-29 22:14:10

Yes, the example in Appendix D does suggest a processing model where entity values are retrieved preassembled from the DTD, rather than IKEA flatpack fashion. That would certainly explain why the attribute-value normalisation rules don't distinguish whether the original entity value was the character itself or a numeric reference. I'll have to go through the spec more thoroughly to see if I can find a more explicit specification of this processing model. Thanks.

d__ 2010-01-29 22:43:34

Answer 2

A:

Section 4.5: Construction of Entity Replacement Text of the XML spec defines two important distinctions.

For every entity there's a distinction between its literal entity value and the replacement text that's extracted from its literal value.
There are different rules for this mapping depending on whether it's an internal or an external entity.

An external entity, for our current purposes, can be thought of as being like an include file in C or PHP - it's a file or another external resource whose content is inserted and then processed. An internal entity is carried in the payload of the DTD, and to ensure that arbitrary internal entities can be carried without being mixed up with the DTD syntax, it's carried in an escaped form known as the literal entity value. In order to convert the literal entity value to its replacement text, the following rule is applied:

For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.

So:

A literal entity value of "[TAB]" maps to the replacement text [TAB]. I'm declaring here an ad-hoc escape mechanism where [TAB] means the tab character, since I can't type a tab into this textbox and have it understood - I hope that doesn't confuse things, but rather demonstrates the fact there are good reasons to have escape mechanisms, so the important thing is to understand where they're being used and how something that looks complicated can be decomposed into different levels of escape mechanism.
A literal entity value of "&x9;" also maps to the replacement text [TAB]. So as far as the attribute-value normalisation logic is concerned, it is a tab, and it doesn't know that it was represented in the internal entity using a character reference. It might seem like that's redundant or that some information is lost, but not really - escape mechanisms allow you to escape anything, including things that you don't need to escape - for example you could probably replace every use of the Latin lower case a in a HTML file by a and neither gain nor lose information.
A literal entity value of "&#x9;" maps to the replacement text 	. The attribute-value normalisation logic will interpret that as a character reference for a tab, and will normalise its value as a tab rather than collapsing it.
A literal entity value of "&#38;#x9;" maps to the replacement text &#x9;
And so on...

It seems like some sort of off-by-one or double-encoding error that in order for a [TAB] to show up in an attribute value, your internal entity has to contain the literal text &#x9;. The impression of a double-encoding error is created by the fact that DTD's happen to use the same character escape mechanism as XML does, but for different reasons. If DTD's used a different escape mechanism, for example along the lines of \u0009 for a tab, then the literal entity value would contain \uyyyy-escaped characters interspersed with &#xyyyy-escaped characters and we could always tell what escape mechanism belonged to what level. Anyway, that's not the way it's done, so we have to just have a good idea of what's going on... it's like for example if you're writing a regex to detect backslashes, you have to escape the backslash in the regex by doubling it, and if you're using a language without regex literals, you have to put it in a string with correct escapes, so it ends up as four backslashes in a row, which looks completely wrong but it's right when you think about the interaction of different levels of escape mechanism (by the way, I originally tried to write out those backslashes, but in order to get around Stackoverflow's own escape mechanism I would have had to write eight backslashes in a row, and it didn't feel safe to write that)

The above seems ok to me at the moment as an explanation of the spec and of the Java implementation as demonstrated in the sample code. It's obviously not consistent with the PHP sample, and I don't mean to imply that there's a bug - the PHP DOM implementation sits on top of a mature C library, with a lot of configuration options, one or more of which might be tweakable to get behaviour that's consistent with the Java sample. Examples like this bring home to me how complicated XML is... simplified explanations like the one I give above may be useful to get a broadgrained idea of what goes on 95% of the time, but the other 5% can be very hard to understand and explain. So if there's a flaw with my explanation, or you have a better explanation, please add a comment or another answer, the more pedantic the better.

d__ 2010-02-04 23:02:56

ansaurus

tags:

views:

answers:

XML attribute-value normalisation - how should whitespace in entities be treated?

related questions