views:

482

answers:

2

I have an XML document based what Excel produces when saving as "XML Spreadsheet 2003 (*.xml)".

The spreadsheet itself contains a header section with a hierarchy of labels:

 | A     B     C     D     E     F     G     H     I
-+-----------------------------------------------------
1| a1                                  a2
2| a11         a12         a13         a21   a22
3| a111  a112  a121  a122  a131  a132        a221  a222

This hierarchy is present on all sheets in the workbook, and looks more or less the same everywhere.

Excel XML works exactly like ordinary HTML tables. (<row>s that contain <cell>s). I have been able to transform everything into such a tree structure:

<node title="a1" col="1">
  <node title="a11" col="1">
    <node title="a111" col="1"/>
    <node title="a112" col="2"/>
  </node>
  <node title="a12" col="3">
    <node title="a121" col="3" />
    <node title="a122" col="4" />
  </node>
  <!-- and so on -->
</node>

But here is the complication:

  • there is more than one worksheet, so there is a tree for each of them
  • the hierarchy may be slightly different on each sheet, the trees will not be equal (for example, sheet 2 may have "a113", while the others don't)
  • tree depth is not explicitly limited
  • the labels however are meant to be the same across all sheets, which means they can be used for grouping

I'd like to merge these separate trees into one that looks like this:

<node title="a1">
  <col on="sheet1">1</col>
  <col on="sheet2">1</col>
  <node title="a11">
    <col on="sheet1">1</col>
    <col on="sheet2">1</col>
    <node title="a111">
      <col on="sheet1">1</col>
      <col on="sheet2">1</col>
    </node>
    <node title="a112">
      <col on="sheet1">2</col>
      <col on="sheet2">2</col>
    </node>
    <node title="a113"><!-- different here -->
      <col on="sheet2">3</col>
    </node>
  </node>
  <node title="a12">
    <col on="sheet1">3</col>
    <col on="sheet2">4</col>
    <node title="a121">
      <col on="sheet1">3</col>
      <col on="sheet2">4</col>
    </node>
    <node title="a122">
      <col on="sheet1">4</col>
      <col on="sheet2">5</col>
    </node>
  </node>
  <!-- and so on -->
</node>

Ideally I'd like to be able to do the merge before I even build the three structure from the Excel XML (if you get me started on this, it'd be great). But since I have no idea how I would do this, a merge after the trees have been built (i.e.: the situation described above) will be fine.

Thanks for your time. :)

+1  A: 
Steve Jessop
+2  A: 

Here is one possible solution in XSLT 1.0:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

    <xsl:template match="/*">
      <t>
        <xsl:apply-templates
           select="node[@title='a1'][1]">
          <xsl:with-param name="pOther"
            select="node[@title='a1'][2]"/>
        </xsl:apply-templates>
      </t>
    </xsl:template>

    <xsl:template match="node">
      <xsl:param name="pOther"/>

      <node title="{@title}">
        <col on="sheet1">
          <xsl:value-of select="@col"/>
        </col>
       <xsl:choose>
         <xsl:when test="not($pOther)">
           <xsl:apply-templates mode="copy">
             <xsl:with-param name="pSheet" select="'sheet1'"/>
           </xsl:apply-templates>
         </xsl:when>
         <xsl:otherwise>
           <col on="sheet2">
             <xsl:value-of select="$pOther/@col"/>
           </col>
           <xsl:for-each select=
             "node[@title = $pOther/node/@title]">

             <xsl:apply-templates select=".">
               <xsl:with-param name="pOther" select=
                "$pOther/node[@title = current()/@title]"/>
             </xsl:apply-templates>
           </xsl:for-each>

           <xsl:apply-templates mode="copy" select=
             "node[not(@title = $pOther/node/@title)]">
             <xsl:with-param name="pSheet" select="'sheet1'"/>
           </xsl:apply-templates>

           <xsl:apply-templates mode="copy" select=
             "$pOther/node[not(@title = current()/node/@title)]">
             <xsl:with-param name="pSheet" select="'sheet2'"/>
           </xsl:apply-templates>
         </xsl:otherwise>
       </xsl:choose>
      </node>
    </xsl:template>

    <xsl:template match="node" mode="copy">
      <xsl:param name="pSheet"/>

      <node title="{@title}">
        <col on="{$pSheet}">
          <xsl:value-of select="@col"/>
        </col>

        <xsl:apply-templates select="node" mode="copy">
          <xsl:with-param name="pSheet" select="$pSheet"/>
        </xsl:apply-templates>
      </node>
    </xsl:template>
</xsl:stylesheet>

When the above transformation is applied on this XML document (the concatenation of the two XML documents under a common top node -- left as an exercise for the reader :) ):

<t>
    <node title="a1" col="1">
     <node title="a11" col="1">
      <node title="a111" col="1"/>
      <node title="a112" col="2"/>
     </node>
     <node title="a12" col="3">
      <node title="a121" col="3" />
      <node title="a122" col="4" />
     </node>
     <!-- and so on -->
    </node>
    <node title="a1" col="1">
     <node title="a11" col="1">
      <node title="a111" col="1"/>
      <node title="a112" col="2"/>
      <node title="a113" col="3"/>
     </node>
     <node title="a12" col="4">
      <node title="a121" col="4" />
      <node title="a122" col="5" />
     </node>
     <!-- and so on -->
    </node>
</t>

The wanted result is produced:

<t>
    <node title="a1">
     <col on="sheet1">1</col>
     <col on="sheet2">1</col>
     <node title="a11">
      <col on="sheet1">1</col>
      <col on="sheet2">1</col>
      <node title="a111">
       <col on="sheet1">1</col>
       <col on="sheet2">1</col>
      </node>
      <node title="a112">
       <col on="sheet1">2</col>
       <col on="sheet2">2</col>
      </node>
      <node title="a113">
       <col on="sheet2">3</col>
      </node>
     </node>
     <node title="a12">
      <col on="sheet1">3</col>
      <col on="sheet2">4</col>
      <node title="a121">
       <col on="sheet1">3</col>
       <col on="sheet2">4</col>
      </node>
      <node title="a122">
       <col on="sheet1">4</col>
       <col on="sheet2">5</col>
      </node>
     </node>
    </node>
</t>

Do note the following:

  1. We suppose that both top node elements have "a1" as the value of their title attribute. This can easily be generalized.

  2. The template matching node has a parameter named pOther, which is the corresponding element named node from the other document. This template is applied - to only if $pOther exists.

  3. When no corresponding element named node exists, another template, also matching node, but in mode copy is applied. This template has a parameter named pSheet, the value of which is the sheet name (string) this element belongs to.

Dimitre Novatchev
@Dimitre: I am a little short on time today and will come back to your solution as soon as possible. Do not get impatient. :)
Tomalak