tags:

views:

171

answers:

4

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like &quot; and &lt; be returned literally, and not decoded as characters (" and <).

A previous similar question, http://stackoverflow.com/questions/1979785/read-escaped-quote-as-escaped-quote-from-xml, received one answer that seems to be specific to Apache, and another that appears to simply not not do what it says it does. I'd love to be proven wrong on either count, however :)

For reference, here is some code:

  file = new File(fileName);
  DocBderFac = DocumentBuilderFactory.newInstance();
  DocBder = DocBderFac.newDocumentBuilder();
  doc = DocBder.parse(file);

  NodeList textElmntLst = doc.getElementsByTagName(text);
  Element textElmnt = (Element) textElmntLst.item(0);

  NodeList txts = textElmnt.getChildNodes(); 
  String txt = ((Node) txts.item(0)).getNodeValue();
  System.out.println(txt);

I would like that println() to produce things like

&quot;3&gt;2&quot;

instead of

"3>2"

which is what currently happens. Thanks!

+1  A: 

One approach might be to try dom4j, and to use the Node.asXML() method. It might return a deep structure, so it might need cloning to get just the node or text you want without any of its children.

John
+2  A: 

You can turn them back into xml-encoded form by

 StringEscapeUtils.escapeXml(str);

(javadoc, commons-lang)

Bozho
A: 

Both good answers, but both a little too heavy-weight for this very small-scale application. I ended up going with the total hack of just stripping out all &s (I do this to &s that aren't part of escapes later anyway). It's ugly, but it's working.

Edit: I understand there's all kinds of things wrong with this, and that the requirement is stupid. It's for a school project, all that matters is that it work in one case, and the requirement is not my fault :)

Personman
It will stop working at one point and you will wonder where did it come from ;)
Bozho
+2  A: 

I'm using a DocumentBuilder to parse XML files. However, the specification for the project requires that within text nodes, strings like " and < be returned literally, and not decoded as characters (" and <).

Bad requirement. Don't do that.

Or at least consider carefully why you think you want or need it.

CDATA sections and escapes are a tactic for allowing you to pass text like quotes and '<' characters through XML and not have XML confuse them with markup. They have no meaning in themselves and when you pull them out of the XML, you should accept them as the quotes and '<' characters they were intended to represent.

Don Roby