tags:

views:

393

answers:

2

I'm using dom4j to parse my xml. Let's say I have something like this:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>&#402;</bar>
</foo>

When looking at the value of the "bar" node, it gives me back the special character as represented by "& #402;"

Is there a way to prevent this and just read in the actual bit of text?

+1  A: 

The actual bit of text being &#402;? You need to escape ampersand as &amp; then.

ChssPly76
digiarnie
Well, there's a difference between reading and writing. For writing you can call setEscapeText(false) on org.dom4j.io.XMLWriter to write whatever you have verbatim. If you do that, keep in mind that your reading / writing cycle will change the document so you have to be careful.
ChssPly76
+2  A: 

If the value of the bar node were to contain < or > or an & on its own then it would break the parser. In order to protect against this you should escape all data on the way in, and then unescape it on the way out again.

This turns your document into:

<?xml version="1.0" encoding="UTF-8"?>
<foo>
    <bar>&amp;#402;</bar>
</foo>

It does suck, but that's XML for you.

banjollity
+1 for the final XML comment
digiarnie