See update at end.
1, the most important question. It's possible to do; the question is whether you'd have to write an XML parser manually in XSLT, or use an extension function, or whether there's a convenient, portable solution. Update: If you can use Saxon's parse() extension function, that's by far your best bet. Do you have access to that?
2 is easy to answer: yes, use the identity transform. This will not preserve all lexical details of the input XML, such as order of attributes, or whether <foo/>
is written as <foo></foo>
. However it will preserve all details that are supposed to matter to XML processors.
But this won't help you if you can't run 2 stylesheets in a pipeline, right?
3: Hmm... not robustly. If your output is going to be displayed by a browser, or handled by something else that understands an XML stylesheet processing instruction, you could output one of those, and hope (against the spec's recommendation!) that serialization and parsing would occur in between this stylesheet and the one you associated on output. But this would be very fragile. I say "against the spec's recommendation" because here it says,
When this or any other mechanism
yields a sequence of more than one
XSLT stylesheet to be applied
simultaneously to a XML document, then
the effect should be the same as
applying a single stylesheet that
imports each member of the sequence in
order
which would imply, without serialization and parsing in between. Not recommended.
Update: a new comment says that you don't know in advance which elements will contain CDATA sections. I jumped to the conclusion that this meant you didn't know which elements would contain unparsed data (since XML processors officially don't know or care what elements are in CDATA sections, per se). In that case, all bets are off. As you may know, XML processors are not supposed to know which parts of an XML input doc are in CDATA sections. CDATA
is just a different way of escaping markup, an alternative to <
etc. Once the data is parsed (which is not properly under the XSLT processor's jurisdiction), you can't tell how it was initially expressed in markup. A left pointy bracket remains a left pointy bracket whether it's expressed as <![CDATA[ < ]]>
or <
. Just as in C, it doesn't matter whether you specify a character as 'A' or 65 or 0x41; once the program is compiled, your code won't be able to tell the difference.
Therefore, if you don't have another way of determining which data in your input document needs to be parsed, then none of the above methods will help you: you can't know where to apply saxon:parse(), nor manual parsing, nor disable-output-escaping with a following XSLT transformation.
Workarounds:
You could guess, e.g. with test="contains(., '<')"
, which nodes contain unparsed data. (Note this tests for the left pointy bracket, regardless of whether it's expressed as a character entity, or part of a CDATA section, or any other way.) You'd sometimes get false positives, e.g. if a text node contained the string "year < 2001". Or you could attempt to parse every text node (very inefficient), and for those that parse successfully as well-formed XML documents, output the tree instead of the text.
Or you could preprocess the XML with a non-XML tool (like LexEv), which therefore can "see" the CDATA markup. But you've said that you can't control anything outside the single XSLT.
Or, ideally, you could send the message back up the chain that the XML you're being given is unworkable: they need to flag somehow, other than by using CDATA markup, which sections contain unparsed data. Usually this would be done either by specifying certain element names, or by using attribute flags. Obviously this would depend on who's supplying the XML.
Another update
OK, now I understand: so you know which element contains unparsed data (and you know it's marked up with CDATA), but you don't know which other data might be marked up with CDATA.
the idea was to transform [i.e. parse -Lars] the known
CDATA node ("fred") into XML nodes
while leaving the whole of the rest
of the document as original input,
so that it could then be piped through
the "general" transformation
For this purpose, "leaving the whole of the rest of the document as original input" does not need to mean preserving any CDATA markup. (The general transformation downstream will not know or care what data is CDATA-escaped.) All that is required is that the one unparsed node get parsed, and the rest, not get parsed. The identity transform will do the latter just fine; you can ignore what that page says about CDATA sections on the output... the downstream XSLT will not know or care. (Unless you have additional (non-XML) requirements for the output that you haven't told us about.)
So if you could do a two-stylesheet transform, with serialization and parsing in between (i.e. not in a traditional SAX pipeline, for example), then the identity transform would work: you'd just need an additional template for the known unparsed node, with disable-output-escaping, as in Tomalak's answer here.
But if you can't do a two-step transform... what XSLT processor are you using? There may be other avenues specific to it.