I'm trying to ascertain what should happen when an XML parser reads in attribute a
of element x
in the sample below:
<!DOCTYPE x [
<!ELEMENT x EMPTY>
<!ATTLIST x a CDATA #IMPLIED>
<!ENTITY d "
">
<!ENTITY a "
">
<!ENTITY t "	">
<!ENTITY t2 " "><!-- a real tab-->
]>
<x a="CARRIAGE_RETURNS:(&d;
),NEWLINES:(&a;
),TABS:(&t;	&t2; )"/><!-- a real tab at the end -->
The essential part of the Attribute-Value Normalization rules in the spec involves traversing the attribute value and applying this case statement:
- For a character reference, append the referenced character to the normalized value.
- For an entity reference, recursively apply step 3 [that's the case statement] of this algorithm to the replacement text of the entity. [EDIT: replacement text, as distinct from literal entity value, seems to be the key concept in understanding what's going on. See below.]
- For a white space character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
- For another character, append the character to the normalized value.
My reading of those rules would lead me to think that the output of the XML parser for the attribute value should be as follows (interpretation: the same rules apply whether in attribute or entity - character references preserved, actual characters replaced):
CARRIAGE_RETURNS:([CR][CR]),NEWLINES:([NL][NL]),TABS:([TAB][TAB][SPACE][SPACE])
However, the example given a little bit below that in the spec suggests that the output should be as follows, and a Java test I wrote works in exactly that way (interpretation: if it's an entity value, it's always a replacement):
CARRIAGE_RETURNS:([SPACE][CR]),NEWLINES:([SPACE][NL]),TABS:([SPACE][TAB][SPACE][SPACE])
On the other hand, a test I wrote in PHP outputs this (interpretation: if it's an entity value, it's never a replacement):
CARRIAGE_RETURNS:([CR][CR]),NEWLINES:([NL][NL]),TABS:([TAB][TAB][TAB][SPACE])
Similar output is given by running the xml file through an identity XSLT transform using the xsltproc tool:
<x a="CARRIAGE_RETURNS:( ),NEWLINES:( ),TABS:(			 )"/>
So my question is: what should happen and why?
Sample PHP and Java programs below:
PHP:
// Library versions from phpinfo():
// DOM/XML API Version 20031129
// libxml Version 2.6.32
$doc = new DOMDocument();
$doc->load("t.xml");
echo str_replace(array("\t", " ", "\r", "\n"), array("[TAB]", "[SPACE]", "[CR]", "[NL]"), $doc->documentElement->getAttribute("a")), "\n";
Java:
import java.io.*;
class T{
public static void main(String[] args) throws Exception {
String xmlString = readFile(args[0]);
System.out.println(xmlString);
org.w3c.dom.Document doc =
javax.xml.parsers.DocumentBuilderFactory.newInstance().
newDocumentBuilder().
parse(new org.xml.sax.InputSource(new StringReader(xmlString)));
System.out.println(doc.getImplementation());
System.out.println(
doc.
getDocumentElement().
getAttribute("a").
replace("\t", "[TAB]").
replace(" ", "[SPACE]").
replace("\r", "[CR]").
replace("\n", "[NL]")
);
}
// Very rough, but works in this case
private static String readFile(String fileName) throws IOException {
File file = new File(fileName);
InputStream inputStream = new FileInputStream(file);
byte[] buffer = new byte[(int)file.length()];
int length = inputStream.read(buffer);
String result = new String(buffer, 0, length);
inputStream.close();
return result;
}
}