views:

361

answers:

2

Hi, I have an XML file which is transformed with XSL. Some elements have to be changed, some have to be left as is - specifically, text with entities ", &, ', <, > should be left as is, and in my case " and ' are changed to " and ' accordingly.

Test XML:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
    <element>
     &quot;
     &amp;
     &apos;
     &lt;
     &gt;
    </element>
</root>

transformation file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="no" indent="no" />
    <xsl:template match="element">
     <xsl:copy>
      <xsl:value-of disable-output-escaping="no" select="." />
     </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

result:

<?xml version="1.0" encoding="UTF-8"?>
    <element>
     "
     &amp;
     '
     &lt;
     &gt;
    </element>

desired result:

<?xml version="1.0" encoding="UTF-8"?>
    <element>
        &quot;
        &amp;
        &apos;
        &lt;
        &gt;
    </element>

I have 2 questions:

  • why does some of those entities are transformed and other not?
  • how can I get a desired result?

Thank you.

+1  A: 

The reason is, that <, > and & always must be escaped in XML. They have special meaning in XML, so they must be treated specially if they are part of data (instead of markup).

The other two, ' and ", can be escaped, their entity names are known to XML (mainly, to enable correctly functioning attribute values, like this:

<xml ackbar="He said, &quot;It's a trap!&quot;" />
<xml ackbar='He said, "It&apos;s a trap!"' />

In all places where their escaping is not absolutely necessary, they can occur literally.

The resulting info set (e.g. in form of a DOM) will be exactly the same, and you should not care too much whether they occur literally or as an entity in the XML file.

In fact, all of your data could occur in escaped form (numbered entities, as in &#10;) without changing the actual document - only the serialized representation differs.

As long as you work with XML-aware tools (e.g. DOM parsers), you will never notice a difference. Corollary: If you don't work with XML-aware tools (e.g. regex or string manipulation), you should stop that immediately. ;-)

Tomalak
Please review the "desired result:" part of my question - it was not rendered correctly, so I edited it.
tori3852
It does not affect my answer, though. ;-) I can be more explicit: You probably can't get your desired result, and in any case - you should not *care* how single or double quotes are rendered in the XML file.
Tomalak
Seems there are no other opinions and this answer is pretty informative, so I'll accept it. Thank you.
tori3852
A: 

You can always escape the original ampersand, in essence it'd look something like

&amp;quot;
Mark E