ansaurus

Question

Answer 1

+2 A:

Since the film element in the CDATA block appears to be well-formed, you could use disable-output-escaping. If you match of the name/text(), select value-of with DOE and then insert the Language element immediately following.

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
<xsl:output indent="yes"  />

<!--Identity template simply copies content forward -->
<xsl:template match="@*|node()">
    <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
</xsl:template>


<xsl:template match="name/text()">
    <!--disable-output-escaping will prevent the "film" element from being escaped.
    Since it appears to be well-formed you should be safe, but no guarentees -->
    <xsl:value-of select="." disable-output-escaping="yes" />
    <Language>English</Language>
</xsl:template>

</xsl:stylesheet>

Mads Hansen 2010-09-14 11:03:47

+1 for posting what I was thinking.

Per T 2010-09-14 11:16:48

+1 If DOE is posible and there is a strong certainty that CDATA is wellformed

Alejandro 2010-09-14 16:11:42

Answer 2

+1 A:

Another way to solve this which would give you some more control over the transformation is to use Andrew Welsh LexEv XMLReader. This gives you the possibility to process CDATA sections as markup among other things.

Per T 2010-09-14 11:15:32

+1 Interesting solution. Note @Madhu that this is not an XSLT solution but works by supplying a different XML parser to the XSLT processor. May require a Java XSLT processor. If you have control over your XSLT environment enough to use this, it will take care of your parsing problems in a very complete way.

LarsH 2010-09-14 11:21:53

Answer 3

+3 A:

First, the fact that your input XML has "CDATA" is in one sense irrelevant... the XSLT can't tell whether it's CDATA or not. What's key about your input XML is that you have escaped markup <film>...</film>, and you want to turn it into a real element.

If you know that the escaped element will always have a certain name ('film'), and you know where it occurs, you can strip it and replace it easily:

   <xsl:template match="text()[contains(., '&lt;film>')]">
      <film>
         <xsl:value-of select="substring-before(substring-after(., '&lt;film>'),
              '&lt;/film>')"/>
      </film>
   </xsl:template>

If you don't know in advance where the escaped tags will occur and what the element names are, you could use XSLT 2.0's <xsl:analyze-string> to find and replace them. But as Alejandro pointed out, general parsing of XML using regular expressions can get very messy. It would only be feasible if you know the markup will be simple.

LarsH 2010-09-14 11:16:33

+1 a little more exact, in case there are multiple `name/text()`. Good defensive coding

Mads Hansen 2010-09-14 11:27:03

rather you can add <xsl:value-of disable-output-escaping="yes" select="substring-after(.,'<?xml version="1.0" encoding="utf-8"?>')" />

Madhu CM 2010-09-14 12:03:36

Dimitre Novatchev 2010-09-14 13:06:33

@Madhu No, that won't work because the XSLT doesn't see `<?xml version=...>`. It's not part of the source document tree. Even if it were, taking value-of `.` (which I assume to be `/`) would lose all the elements of the document: their tags would be absent from the output. Also, a big reason for the above is to avoid disable-output-escaping, which is a kludge that is usually avoidable if you treat markup as markup and text as text. XSLT processors aren't even required to honor d-o-e. In some environments, they can't.

LarsH 2010-09-14 13:56:47

@LarsH: I think you should test for `contains(.,'lt;film>')`. Also, I don't think is a good practice to recommend to parse XML with RegExp...

Alejandro 2010-09-14 14:37:22

@Alejandro: thanks for catching the typo... I fixed it. Good point that parsing XML with regexp is not a good idea in general. Updated my answer accordingly. But don't you think it can be a lesser evil than d-o-e, especially if the markup is not too complex? and especially in situations where d-o-e won't work at all. One more tool in the toolchest, more of a last resort than a primary tool.

LarsH 2010-09-14 15:35:24

@LarsH: First, +1 for good answer for this specific case. I think that DOE or ussing RegExp for parsing are equaly "evil". Ja!

Alejandro 2010-09-14 16:08:07

ansaurus

tags:

views:

answers:

extract cdata using xslt

related questions