views:

35

answers:

2

I've got wads of autogenerated HTML doing stupid things like this:

 <p>Hey it's <em>italic</em><em>italic</em>!</p>

And I'd like to mash that down to:

 <p>Hey it's <em>italicitalic</em>!</p>

My first attempt was along these lines...

<xsl:template match="em/preceding::em">
    <xsl:value-of select="$OPEN_EM"/>
    <xsl:apply-templates/>
</xsl:template>

<xsl:template match="em/following::em">
    <xsl:apply-templates/>
    <xsl:value-of select="$CLOSE_EM"/>
</xsl:template>

But apparently the XSLT spec in its grandmotherly kindness forbids the use of the standard XPath preceding or following axes in template matchers. (And that would need some tweaking to handle three ems in a row anyway.)

Any solutions better than forgetting about doing this in XSLT and just running a replace('</em><em>', '') in $LANGUAGE_OF_CHOICE on the end result? Rough requirements: should not combine two <em> if they are separated by anything (whitespace, text, tags), and while it doesn't have to merge them, it should at least produce valid XML if there are three or more <em> in a row. Handling tags nested within the ems (including other ems) is not required.

(And oh, I've seen http://stackoverflow.com/questions/1542775/how-to-merge-element-using-xslt, which is similar but not quite the same. XSLT 2 is regrettably not an option and the proposed solutions look hideously complex.)

+2  A: 

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:key name="kFollowing"
  match="em[preceding-sibling::node()[1][self::em]]"
  use="generate-id(preceding-sibling::node()[not(self::em)][1])"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match=
    "em[following-sibling::node()[1][self::em]
      and
        not(preceding-sibling::node()[1][self::em])
       ]">
   <em>
     <xsl:apply-templates select=
     "node()
     |
      key('kFollowing',
           generate-id(preceding-sibling::node()[1])
          )/node()"/>
   </em>
 </xsl:template>
 <xsl:template match=
 "em[preceding-sibling::node()[1][self::em]]"/>
</xsl:stylesheet>

when applied on the following XML document (based on the provided document, but with three adjacent em elements):

<p>Hey it's <em>italic1</em><em>italic2</em><em>italic3</em>!</p>

produces the wanted, correct result:

<p>Hey it's <em>italic1italic2italic3</em>!</p>

Do note:

  1. The use of the identity rule to copy every node as is.

  2. The use of a key in order to specify conveniently the following adjacent em elements.

  3. The overriding of the identity transform only for em elements that have adjacent em elements.

  4. This transformation merges any number of adjacent em elements.

Dimitre Novatchev
+2  A: 

This is also like grouping adjacents:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()[1]|@*"/>
        </xsl:copy>
        <xsl:apply-templates select="following-sibling::node()[1]"/>
    </xsl:template>
    <xsl:template match="em">
        <em>
            <xsl:call-template name="merge"/>
        </em>
        <xsl:apply-templates
             select="following-sibling::node()[not(self::em)][1]"/>
    </xsl:template>
    <xsl:template match="node()" mode="merge"/>
    <xsl:template match="em" name="merge" mode="merge" >
        <xsl:apply-templates select="node()[1]"/>
        <xsl:apply-templates select="following-sibling::node()[1]" 
                             mode="merge"/>
    </xsl:template>
</xsl:stylesheet>

Output:

<p>Hey it's <em>italicitalic</em>!</p>

Note: Fine graneid traversal identity rule (copy everything, node by node); em rule (always the first, because the process is node by node), wraping and call merge template, apply template to next sibling not em; em rule in merge mode (also called merge), aplly templates to first child (this case it's just a text node, but this allows nested elements) and then to next sibling in merge mode; "break" rule, matching any thing not em (because name test beats node type test in priority) stops the process.

Alejandro
@Alejandro: This is a very short solution, but difficult to understand. I needed a debugger to see what is taking place. This is especially true to the combination of the last two templates.
Dimitre Novatchev
@Dimitre: Do you think? It's the same patter over an over again for grouping adjacents. Copy everything, match first in group, go to open mode (process all siblings, stop on out of group), process next sibling not in group.
Alejandro
@Alejandro: It would be good to explain this in your answer. Also, the fact that the last template overrides its previous template is not too obvious. Also, the name of the mode is confusing (at least to me). A better name would be "merge" or "merge-em". To make the processing very understandable, I would re-write the empty template in this way: `<xsl:template match="node()[not(self::em)]" mode="open"/>`
Dimitre Novatchev
@Dimitre: I like the suggested mode name.
Alejandro