views:

37

answers:

1

Hello,

I am trying to canonicalize an html text node by com/sun/org/apache/xml/internal/security/c14n/Canonicalizer.java class. My input file has carriage return and a line feed at the end. Upon canonicalization I expect to see the carriage return transformed into 
. However, the the output I get does not contain the carriage return. It only contains the line feed. How should I modify my code to include the carriage return?

example: my input with cr and lf at the end

<MyNode xmlns="http://www.artsince.com/test#"&gt;Lqc3EeJlyY45bBm1lha869dkHWw1w+U8A6aKM2Xuwk3yWTjt0A2Wq/25rAncSBQlBGOCyTmhfic9(crlf)
9mWf4mC2Ui6ccLqCMjFR4mDQApkfoTy+Cu2eHul9CRjKa0TqckFv7ryda9V5MHruueXII/V+gPLT(crlf)
c76LsetK8C1434K66+Q=</MyNode>

this is the sample code I use

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new FileInputStream(new File("C:\\text.xml")));

if(!Init.isInitialized())
{
   Init.init();
}

Path xPath = XPathFactory.newInstance().newXPath();
String expression = "child::*/child::text()"; 
NodeList textNodeList = (NodeList) xPath.evaluate(expression, doc, XPathConstants.NODESET);
Canonicalizer cn = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_OMIT_COMMENTS);
byte[] canonn = cn.canonicalizeXPathNodeSet(textNodeList);
System.out.println(new String(canonn).toCharArray());

and the output I get has only lf in the end

Lqc3EeJlyY45bBm1lha869dkHWw1w+U8A6aKM2Xuwk3yWTjt0A2Wq/25rAncSBQlBGOCyTmhfic9(lf)
9mWf4mC2Ui6ccLqCMjFR4mDQApkfoTy+Cu2eHul9CRjKa0TqckFv7ryda9V5MHruueXII/V+gPLT(lf)
c76LsetK8C1434K66+Q=

however, I expect to see &#xD; and lf at the end of lines

Lqc3EeJlyY45bBm1lha869dkHWw1w+U8A6aKM2Xuwk3yWTjt0A2Wq/25rAncSBQlBGOCyTmhfic9&#xD;(lf)
9mWf4mC2Ui6ccLqCMjFR4mDQApkfoTy+Cu2eHul9CRjKa0TqckFv7ryda9V5MHruueXII/V+gPLT&#xD;(lf)
c76LsetK8C1434K66+Q=
A: 

XML defines that the input can contain all possible kinds of EOL styles but that the parser must replace all of them with a single linefeed (\n, ASCII 10) character.

If you want to protect the character, you must replace ASCII 13 with &#13; yourself before the XML parser sees the input. If you use Java, I suggest to use a FilterInputStream.

Aaron Digulla
does that mean it is wrong to expect replacing cr for canonicalization in this case?
artsince
Not only in this case; XML always swallows it even before the text nodes are created.
Aaron Digulla
I'm afraid I had misguided artsince earlier on the way that c14n is meant to preserve explicit but normalise on U+000D characters, as this had made a difficulty with the equivalent .NET code appear correct to me when it was not. In .NET one wants to do the normalisation prior to loading the XmlDocument to have the correct cases of preserved, as otherwise they won't be distinguished from explicit cases. Easily done, but the fact that often is correct in c14n output had misled me.
Jon Hanna