views:

48

answers:

1

In the middle of an XML document I'm transforming, there is a CDATA node which I know itself is composed of XML. I would like to have that "recursively parsed" as XML so that I can transform it too. Upon searching, I think my question is very similar to http://stackoverflow.com/questions/1927522/handling-node-with-inner-xml-in-xslt.

That was a year ago: may I just clarify the following:

  1. It says this cannot be done by some XSLT in one go: rather you need a two-phase approach. I have just bought a shiny new book on XSLT 2.0. Is is still the case that there is no XSLT instruction to "re-parse" a string node as XML?
  2. In my case the XML-string node is just one node in the whole. Therefore in Phase #1 I would only be transforming a fragment of the input XML document; the rest needs passing through unchanged to Phase #2. I see several solutions to passing input to output unchanged, but often it seems they "mostly work", but skip/do not deal with some kind of node inputs. Is there a relaible construct for passing the rest of the input to the output without any changes?
  3. That approach relies on me being able to apply 2 transforms separately. I am limited (existing application) to only being allowed one transform (the XML output is fixed; it is transformed by one XSLT file; the only thing I can do is put whatever I like into that XSLT file, and/or add further XSLT files, but I cannot influence the top-level call to pass the XML through one XSLT file). Is there anything I could put into an XSLT file which could cause the second XSLT transform to be invoked?
A: 

See update at end.

1, the most important question. It's possible to do; the question is whether you'd have to write an XML parser manually in XSLT, or use an extension function, or whether there's a convenient, portable solution. Update: If you can use Saxon's parse() extension function, that's by far your best bet. Do you have access to that?

2 is easy to answer: yes, use the identity transform. This will not preserve all lexical details of the input XML, such as order of attributes, or whether <foo/> is written as <foo></foo>. However it will preserve all details that are supposed to matter to XML processors.

But this won't help you if you can't run 2 stylesheets in a pipeline, right?

3: Hmm... not robustly. If your output is going to be displayed by a browser, or handled by something else that understands an XML stylesheet processing instruction, you could output one of those, and hope (against the spec's recommendation!) that serialization and parsing would occur in between this stylesheet and the one you associated on output. But this would be very fragile. I say "against the spec's recommendation" because here it says,

When this or any other mechanism yields a sequence of more than one XSLT stylesheet to be applied simultaneously to a XML document, then the effect should be the same as applying a single stylesheet that imports each member of the sequence in order

which would imply, without serialization and parsing in between. Not recommended.

Update: a new comment says that you don't know in advance which elements will contain CDATA sections. I jumped to the conclusion that this meant you didn't know which elements would contain unparsed data (since XML processors officially don't know or care what elements are in CDATA sections, per se). In that case, all bets are off. As you may know, XML processors are not supposed to know which parts of an XML input doc are in CDATA sections. CDATA is just a different way of escaping markup, an alternative to &lt; etc. Once the data is parsed (which is not properly under the XSLT processor's jurisdiction), you can't tell how it was initially expressed in markup. A left pointy bracket remains a left pointy bracket whether it's expressed as <![CDATA[ < ]]> or &lt;. Just as in C, it doesn't matter whether you specify a character as 'A' or 65 or 0x41; once the program is compiled, your code won't be able to tell the difference.

Therefore, if you don't have another way of determining which data in your input document needs to be parsed, then none of the above methods will help you: you can't know where to apply saxon:parse(), nor manual parsing, nor disable-output-escaping with a following XSLT transformation.

Workarounds:

  • You could guess, e.g. with test="contains(., '&lt;')", which nodes contain unparsed data. (Note this tests for the left pointy bracket, regardless of whether it's expressed as a character entity, or part of a CDATA section, or any other way.) You'd sometimes get false positives, e.g. if a text node contained the string "year < 2001". Or you could attempt to parse every text node (very inefficient), and for those that parse successfully as well-formed XML documents, output the tree instead of the text.

  • Or you could preprocess the XML with a non-XML tool (like LexEv), which therefore can "see" the CDATA markup. But you've said that you can't control anything outside the single XSLT.

  • Or, ideally, you could send the message back up the chain that the XML you're being given is unworkable: they need to flag somehow, other than by using CDATA markup, which sections contain unparsed data. Usually this would be done either by specifying certain element names, or by using attribute flags. Obviously this would depend on who's supplying the XML.

Another update OK, now I understand: so you know which element contains unparsed data (and you know it's marked up with CDATA), but you don't know which other data might be marked up with CDATA.

the idea was to transform [i.e. parse -Lars] the known CDATA node ("fred") into XML nodes while leaving the whole of the rest of the document as original input, so that it could then be piped through the "general" transformation

For this purpose, "leaving the whole of the rest of the document as original input" does not need to mean preserving any CDATA markup. (The general transformation downstream will not know or care what data is CDATA-escaped.) All that is required is that the one unparsed node get parsed, and the rest, not get parsed. The identity transform will do the latter just fine; you can ignore what that page says about CDATA sections on the output... the downstream XSLT will not know or care. (Unless you have additional (non-XML) requirements for the output that you haven't told us about.)

So if you could do a two-stylesheet transform, with serialization and parsing in between (i.e. not in a traditional SAX pipeline, for example), then the identity transform would work: you'd just need an additional template for the known unparsed node, with disable-output-escaping, as in Tomalak's answer here.

But if you can't do a two-step transform... what XSLT processor are you using? There may be other avenues specific to it.

LarsH
Thank you for trying! But I think I fall foul of all 3 suggestions!
JonBrave
1. Nope, I'm a Norman, no Saxon possible! I would have needed something in vanilla XSLT.
JonBrave
2. Just as per what I said, the "identity transform" you show is insufficient for "no change" on unknown input. If I understand correctly, I have to know what CDATA sections there will be for it to work robustly, and I don't.
JonBrave
3. As you say, it looks like there isn't a sufficiently reliable non-pipelined approach available.
JonBrave
@JonBrave, re #2, the identity transform is sufficient for "no change" in regard to the XML information model; in particular it will preserve the information in the CDATA sections but not the markup. But I think you're saying you *do* want to change the data (in the CDATA sections) from unparsed to parsed, and you're right, no transform will allow for that if you don't have an independent way of knowing which sections are unparsed. Neither saxon:parse() nor a two-stage approach will help you in that case. Editing answer...
LarsH
@JonBrave: my answer is edited to reflect the fact that you don't have a way of knowing which text nodes contain data that needs parsing. Are you sure you can't determine that from context?
LarsH
Oh dear. My XML will have one node, whose name I do know ("fred", say), which will be a CDATA and will hold XML which needs re-parsing. The rest of the document could be any old XML, CDATAs or not, and does not need re-parsing. In the 2 stage approach suggested, the idea was to transform the known CDATA node ("fred") into XML nodes while leaving <i>the whole of the rest of the document as original input</i>, so that it could then be piped through the "general" transformation. The problem is, the "identity transform" link I read says/implies you must know the names of any CDATAs in advance.
JonBrave
...which I don't (I only know about the one CDATA I want to re-parse, I have no idea what else might or might not be in the XML document)...
JonBrave
@Jon, ok, sorry, I misunderstood. Comment space is too short, so I'll edit my answer.
LarsH
@Jon, edited my answer. What XSLT processor are you using?
LarsH
JonBrave
@JonBrave, I meant, what specific processor, e.g. Saxon, MSXML, libxslt, etc. So that we might consider non-portable solutions. But if you need to move on, that's fine.
LarsH