tags:

views:

514

answers:

2

(Note: If you do well in a functional programming language other than XSLT, this problem might still be for you - it's not necessarily the XSLT that makes this tricky for me.)

How would I transform this:

<!-- note that empty cells are implicit and therefore nonexistent -->
<tables>
  <table pos="1">
    <row pos="1">
      <cell pos="1">Category A</cell>
      <cell pos="10">Category B</cell>
    </row>
    <row pos="2">
      <cell pos="1">Sub-Category 1</cell>
      <cell pos="4">Sub-Category 2</cell>
      <cell pos="6">Sub-Category 3</cell>
      <cell pos="10">Sub-Category 1</cell>
    </row>
  </table>
  <table pos="2">
    <row pos="1">
      <cell pos="1">Category A</cell>
      <cell pos="11">Category B</cell>
    </row>
    <row pos="2">
      <cell pos="1">Sub-Category 1</cell>
      <cell pos="2">Sub-Category 2</cell>
      <cell pos="4">Sub-Category 3</cell>
      <cell pos="10">Sub-Category 4</cell>
      <cell pos="11">Sub-Category 1</cell>
      <cell pos="12">Sub-Category 2</cell>
    </row>
  </table>
</tables>

To this:

<tree>
  <node label="Category A">
    <positions>
      <pos table="1" row="1" cell="1" />
      <pos table="2" row="1" cell="1" />
    </positions>
    <node label="Sub-Category 1">
      <positions>
        <pos table="1" row="2" cell="1" />
        <pos table="2" row="2" cell="1" />
      </positions>
    </node>
    <node label="Sub-Category 2">
      <positions>
        <pos table="1" row="2" cell="4" />
        <pos table="2" row="2" cell="2" />
      </positions>
    </node>
    <node label="Sub-Category 3">
      <positions>
        <pos table="1" row="2" cell="6" />
        <pos table="2" row="2" cell="4" />
      </positions>
    </node>
    <node label="Sub-Category 4">
      <positions>
        <pos table="2" row="2" cell="10" />
      </positions>
    </node>
  </node>
  <node label="Category B">
    <positions>
      <pos table="1" row="1" cell="10" />
      <pos table="2" row="1" cell="11" />
    </positions>
    <node label="Sub-Category 1">
      <positions>
        <pos table="1" row="2" cell="10" />
        <pos table="2" row="2" cell="11" />
      </positions>
    </node>
    <node label="Sub-Category 2">
      <positions>
        <pos table="2" row="2" cell="12" />
      </positions>
    </node>
  </node>
</tree>

Note that each table represents an arbitrarily deep 2D category tree. Each row represents the children of the previous row, where child categories are attached to parents by their respective positions - they are a child if their position is >= the parent to the left and < the next parent to the right (if there is one).

I'd like the output to be a tree, grouped by label (only within the described parent-child relationship, but through tables). For example, "Sub-Category 1" exists separately for both "Category A" and "Category B".

There are n tables with n rows with n cells each. You could imagine the input data structure as a 3D cube, each table representing roughly the same data for a different year.

I can do the above if it was just one table ("year"), but with n tables I have a big problem finding a way of applying the thing, due to the varying positions of the otherwise equal parents.

In the end I'm interested in an extension function free XSLT 1.0 solution, though any (algorithmic) help is greatly appreciated. This problem is haunting me for quite a while now and I'm seemingly unable to wrap my head around it. I feel there must be a clean solution for this, I just can't see it. I'm confident this can be done with a single recursive template, a few keys and some really clever XPath.

I suppose this question is material for the tumbleweed badge, though. :-)

+1  A: 

So the cells in each row are actually the children of the cells in the previous row where the "pos" value of the child cell is >= the "pos" value of the parent cell and < the "pos" value of the subsequent parent (if any) on the previous row?

If you're hell-bent on using XSL for this then take a look at the following-sibling axis. However, it's going to be one messy selection path. Probably easier to do two transforms: one that organizes the flat data into a hierarchical format, and one that produces the final output.

Personally, I think I'd write a program that explicitly processes the DOM tree, particularly if this is a one-off operation. I suspect that it would be just as fast to write, and not have the chance for strange corner cases in the selection logic.

kdgregory
As I said, XSLT itself is not really a problem for me. :-) I all else fails I'm bound to do it by external DOM manipulation. Which is going to be less elegant, but it's going to work. Thanks! (I guess a +1 on every answer to this question is mandatory for me, I think the problem is much above average compared to other XSLT problems here.)
Tomalak
+1  A: 

Here's my first attempt at this one. It looked easy enough, until I ran into trouble with grouping recursively. I thought about using a node-set, but I think that is an extension to XSL 1.0, right? So, no node-set for this? Lacking that, it goes something like this:

  1. Find all the root nodes, handle them differently (they've got no parents, and thus, different rules apply).

  2. For each such node, crawl through every single //cell node and test to see if it might be a direct child of the current node. If it might be a child, we still have to group it, so...

  3. Crawl through every //cell node along the preceding axis. For each such preceding //cell node, see if it might also be a direct child of the parent node found in step 2.

  4. Upon finding a preceding //cell node that is a child node, compare the label of the preceding child with the child found in step 2. If they are equal, don't output anything (since this label was output earlier). Otherwise...

  5. Begin outputting the child node. Crawl through every single //cell node yet again to locate all positions of this label in which it is a child node of the parent node found in step 2.

  6. Recurse: go back to step 2, using this child node as the new parent node.

Tested on MSXML2, looks like it is generating what you asked for. It appears to work as I understand that it should work even if I add some more <row> elements. It was certainly fun to write, and it got me thinking in directions that I don't normally think in while working with XSL. I guess that's good... ? Then again, maybe this is an example of a problem that CAN be solved with XSL, but could be solved more easily some other way.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:output method="xml" indent="yes"/>

    <xsl:key name="roots" match="/tables/table/row[1]/cell" use="text()" />
    <xsl:key name="cell-by-label" match="/tables/table/row/cell" use="text()"/>
    <xsl:variable name="cells" select="/tables/table/row/cell" />

    <xsl:template match="/tables">

     <xsl:variable name="rootCells" select="/tables/table/row[1]/cell"/>

     <tree>
      <!-- Work on each root-level nodes first. -->
      <xsl:apply-templates
       mode="root"
       select="$rootCells[count(.|key('roots', text())[1]) = 1]" />
     </tree>
    </xsl:template>

    <!--
    Root level cells are handled differently. They have no available $parent,
    for one thing. Because of that, it seems simpler to handle them as an
    exception here, rather than peppering the mode="crawl" with exceptions
    for parentless nodes.
    -->
    <xsl:template match="cell" mode="root">
     <node label="{.}">

      <!--
      Get a list of everywhere that this cell is found.
      We are looking for only other root-level cells here.
      -->
      <positions>
       <xsl:for-each select="key('roots', text())">
        <pos table="{../../@pos}" row="{../@pos}" cell="{@pos}"/>
       </xsl:for-each>
      </positions>

      <!--
      Locate all child nodes, across all tables in which this node is found.
      A node is a child node if:
      1. It is in a row directly following the row in which this cell is found.
      2. If the @pos is >= the @pos of the parent
      3. If the @pos is < the @pos of the parent to the right (if there is a parent to the right)
         Note: Meeting the above conditions is not difficult; it's grouping at this
               point that gives trouble. If the problem permitted extension functions
               to XSL 1.0, then perhaps we could generate a node-set and group that.
               I've not tried that way, since I'm under the impression that a node-set
               would be an extension to 1.0, and therefore, "cheating."
               However, if we could generate a node-set and group based on that,
               then the following block selects the correct nodes (I think):

      <xsl:for-each select="$matches">
       <xsl:variable name="childRow" select="../following-sibling::row[1]"/>
       <xsl:variable name="Lparent" select="@pos"/>
       <xsl:variable name="Rparent" select="following-sibling::cell[1]/@pos"/>
       <xsl:choose>
       <xsl:when test="$Rparent">
        <xsl:apply-templates
         select="$childRow/cell[
              @pos &gt;= $Lparent
          and @pos &lt; $Rparent]"
         mode="child" />
       </xsl:when>
       <xsl:otherwise>
        <xsl:apply-templates
         select="$childRow/cell
          [@pos &gt;= $Lparent]"
         mode="child"/>
       </xsl:otherwise>
       </xsl:choose>
      </xsl:for-each>

      Without using a node-set, I'll try to solve this by crawling over every
      table/row/cell and test each one, every time.  Not pretty by a long shot.
      But hey, memory and processors are getting cheaper every day.
      -->

      <xsl:apply-templates select="$cells" mode="crawl">
      <xsl:with-param name="parent" select="."/>
      <xsl:with-param name="task">Group children by their labels</xsl:with-param>
      </xsl:apply-templates>

     </node>
    </xsl:template>

    <xsl:template match="cell" mode="child">
    <xsl:param name="parent"/>

     <node label="{.}">

      <positions>
       <xsl:apply-templates mode="crawl"
        select="key('cell-by-label', text())">
       <xsl:with-param name="parent" select="$parent"/>
       <xsl:with-param name="child"  select="."/>
       <xsl:with-param name="task">Find all positions of a child</xsl:with-param>
       </xsl:apply-templates>
      </positions>

      <!-- And.... Recursion Start Now! -->
      <xsl:apply-templates select="$cells" mode="crawl">
      <xsl:with-param name="parent" select="."/>
      <xsl:with-param name="task">Group children by their labels</xsl:with-param>
      </xsl:apply-templates>

     </node>

    </xsl:template>

    <xsl:template match="cell" mode="crawl">
    <xsl:param name="parent"/>
    <xsl:param name="child"/>
    <xsl:param name="task"/>

     <xsl:variable name="parentRow"
      select="generate-id(../preceding-sibling::row[1])"/>

     <xsl:variable name="parentCell"
      select="key('cell-by-label', $parent/text())
       [$parentRow = generate-id(..)]" />

     <xsl:variable name="RparentPos"
      select="$parentCell/following-sibling::cell[1]/@pos"/>

     <!--
     This cell is a child if it is in a row directly following a row
     in which the parent cell's text value made an appearance.
     <xsl:if test="$parentCell">

     This cell is a child if it's @pos is >= the parent cell's pos
     <xsl:if test="@pos &gt;= $parentCell/@pos">

     If there is a parent cell to the right of this cell's parent cell,
     this this cell is a child only if its @pos is < the right-parent
     cell's @pos.
     <xsl:if test="not($RparentPos) or @pos &lt; $RparentPos">
     -->

     <xsl:if test="
      $parentCell
      and (@pos &gt;= $parentCell/@pos)
      and (not($RparentPos) or @pos &lt; $RparentPos)">

      <xsl:choose>

      <!--
      If our task is to determine whether there are any nodes prior to
      the given child node in the document order which are also
      children of the parent and which have the same label value, we do
      that now. All we really want is to make a mark. We will later use
      string-length to see if we made any marks here or not.
      -->
      <xsl:when test="$task = 'Are there prior children with equal labels?'">
       Yes
      </xsl:when>

      <!--
      Here, our task is to generate the <pos> nodes of the children.
      -->
      <xsl:when test="$task = 'Find all positions of a child'">
       <pos table="{../../@pos}" row="{../@pos}" cell="{@pos}"/>
      </xsl:when>

      <!--
      If our task is to group children by their labels, we need to know
      if this is the first child node with this particular label. To do
      that, we crawl over all cells along the preceding axis, and if
      they are otherwise potential children (see above block), then a
      mark is made (perhaps several such marks, doesn't matter how many,
      really). If we have any marks when we are done, we know we have
      output this label before, so we don't do it again.
      -->
      <xsl:when test="$task = 'Group children by their labels'">
       <xsl:variable name="priorMatches">
        <xsl:apply-templates mode="crawl"
         select="preceding::cell[text() = current()/text()]">
        <xsl:with-param name="parent" select="$parent"/>
        <xsl:with-param name="task">Are there prior children with equal labels?</xsl:with-param>
        </xsl:apply-templates>
       </xsl:variable>

       <xsl:if test="string-length($priorMatches) = 0">
        <xsl:apply-templates select="." mode="child">
        <xsl:with-param name="parent" select="$parent"/>
        </xsl:apply-templates>
       </xsl:if>
      </xsl:when>

      </xsl:choose>
     </xsl:if>

    </xsl:template>
</xsl:stylesheet>

Edit 1: Reformatted slightly. Added the use of a few xsl:key elements to help find child nodes by their label. Made a few other optimizations--hopefully without reducing readability.

Thoughts / comments?

Chris Nielsen
I'm sorry for getting back to you that late. I'm in the process of working through your solution right now. :)
Tomalak
Any progress here? It was a neat problem, and if you'd like it a different way, I can take another swing at it, as my schedule allows!
Chris Nielsen