views:

178

answers:

1

I am using the below XSL 2.0 code to find the ids of the text nodes that contains the list of indices that i give as input. the code works perfectly but in terms for performance it is taking a long time for huge files. Even for huge files if the index values are small then the result is quick in few ms. I am using saxon9he Java processor to execute the XSL.

<xsl:variable name="insert-data" as="element(data)*"> 
  <xsl:for-each-group 
    select="doc($insert-file)/insert-data/data" 
    group-by="xsd:integer(@index)"> 
    <xsl:sort select="current-grouping-key()"/> 
    <data 
      index="{current-grouping-key()}" 
      text-id="{generate-id(
        $main-root/descendant::text()[
          sum((preceding::text(), .)/string-length(.)) ge current-grouping-key()
        ][1]
      )}"> 
      <xsl:copy-of select="current-group()/node()"/> 
    </data> 
  </xsl:for-each-group> 
</xsl:variable> 

In the above solution if the index value is too huge say 270962 then the time taken for the XSL to execute is 83427ms. In huge files if the index value is huge say 4605415, 4605431 it takes several minutes to execute. Seems the computation of the variable "insert-data" takes time though it is a global variable and computed only once. Should the XSL be addessed or the processor? How can i improve the performance of the XSL.

+1  A: 

I'd guess the problem is the generation of text-id, i.e. the expression

generate-id(
    $main-root/descendant::text()[
      sum((preceding::text(), .)/string-length(.)) ge current-grouping-key()
    ][1]
  )

You are potentially recalculating a lot of sums here. I think the easiest path here would be to invert your approach: recurse across the text nodes in the document, aggregate the string length so far, and output data elements each time a new @index is reached. The following example illustrates the approach. Note that each unique @index and each text node is visited only once.

<xsl:variable name="insert-doc" select="doc($insert-file)"/>

<xsl:variable name="insert-data" as="element(data)*"> 
    <xsl:call-template name="calculate-data"/>
</xsl:variable>

<xsl:key name="index" match="data" use="xsd:integer(@index)"/>

<xsl:template name="calculate-data">
    <xsl:param name="text-nodes" select="$main-root//text()"/>
    <xsl:param name="previous-lengths" select="0"/>
    <xsl:param name="indexes" as="xsd:integer*">
        <xsl:perform-sort 
            select="distinct-values(
                    $insert-doc/insert-data/data/@index/xsd:integer(.))">
            <xsl:sort/>
        </xsl:perform-sort>
    </xsl:param>
    <xsl:if test="$text-nodes">
        <xsl:variable name="total-lengths" 
            select="$previous-lengths + string-length($text-nodes[1])"/>
        <xsl:choose>
            <xsl:when test="$total-lengths ge number($indexes[1])">
                <data 
                    index="{$indexes[1]}" 
                    text-id="{generate-id($text-nodes[1])}">
                    <xsl:copy-of select="key('index', $indexes[1], 
                                             $insert-doc)"/> 
                </data>
                <!-- Recursively move to the next index. -->
                <xsl:call-template name="calculate-data">
                    <xsl:with-param
                        name="text-nodes"
                        select="$text-nodes"/>
                    <xsl:with-param
                        name="previous-lengths" 
                        select="$previous-lengths"/>
                    <xsl:with-param
                        name="indexes" 
                        select="subsequence($indexes, 2)"/>
                </xsl:call-template>                    
            </xsl:when>
            <xsl:otherwise>
                <!-- Recursively move to the text node. -->
                <xsl:call-template name="calculate-data">
                    <xsl:with-param 
                        name="text-nodes" 
                        select="subsequence($text-nodes, 2)"/>
                    <xsl:with-param
                        name="previous-lengths" 
                        select="$total-lengths"/>
                    <xsl:with-param 
                        name="indexes" 
                        select="$indexes"/>
                </xsl:call-template>                    
            </xsl:otherwise>
        </xsl:choose>
    </xsl:if>
</xsl:template>
markusk
Thanks for your reposnse. Will try this and update soon.
Rachel
Great, the reponse is in few ms. It has got reduced from 82242 to 813. Thanks a lot!! The value for the "data" node alone does not come in the result. Line: <xsl:copy-of select="key('index', $indexes[1], $insert-doc)"/> which is getting read from, <xsl:key name="index" match="data" use="@index"/>.
Rachel
@Rachel: Good to hear that your response time improved. There was a bug in my original key definition, you need to use `<xsl:key name="index" match="data" use="xsd:integer(@index)"/>` to get correct results.
markusk
It works now. Thanks.
Rachel
In my case the index value may not be unique i.e. $insert-doc/insert-data/data/@index. In this case there are multiple data tags created for the same index. How can this be resolved? Pls give your inputs.
Rachel
@Rachel: By using `distinct-values`. I edited my answer to use this function, see the new definition for the parameter `indexes`. Actually, I made this update to my answer almost an hour ago, but I guess you got the old version. :-)
markusk
oh yes. I did not notice.
Rachel
@Rachel: No problem. I hope `distinct-values` solved the issue?
markusk
Your solution works perfectly with distinct-values. I slightly modified the XSL to incorporate an additional condition required for me. Adding this caused reduction in performace i.e increasing the response time by twice. How can this be re-written efficiently. I have edited by question and added a sample code with the new condition included.
Rachel
@Rachel: If you want to test whether the current index node contains an element named "end", you can write `<xsl:when test="$indexes[1]/end">` instead of `<xsl:when test="contains($indexes[1]/node()/name() , 'end')">`. Does that improve your performance?
markusk
The condition is not getting satisfied. I have namespace prefix pre:end. Is it because of a namespace prefix?
Rachel