views:

948

answers:

9

I am a support engineer and our company's product allows XSLT transforms to customize outputs.

I made a xsl transform for this purpose. It works well for source files of typical size (several 100k), but occasionally a really huge (10M) source file will come by. In such case, the output is not generated even if I let it grind several days.

The SW engineering team tested it and discovered that for the transform and large source file in question is indeed very slow (>days), if our product is compiled to use the transform engine in .Net 1.1, but if they compile it with .Net 2.0, it is plenty fast (about 1-2 minutes).

The long term solution obviously is, wait for the next release.

For the short term I am wondering the following: 1) Is XSLT flexible enough that there are more efficient and less efficient ways to acheive the same result? For example, is it possible that the way I structured the xsl, the transform engine has to iterate from the beginning of the source file many many times, taking longer and longer as the next result piece gets farther and farther from the beginning? (Schlemiel the Painter), or 2) Is it more dependent on how the transform engine interprets the xsl?

If 2 is the case, I don't want to waste a lot of time trying to improve the xsl (I am not a big xsl genius, it was hard enough for me to achieve what little I did...).

Thanks!

+1  A: 

It is definitely possible to write inefficient XSLT. However, if the discrepancy is truly 100K files vs. 10M files, I'm not sure what improvements you would see by making small modifications to your transform as that is one big file.

It might be worth posting some example code (if possible) to help describe what performance gains might be achieved.

Scott Saad
Scott, I put some very crude sample in my latest post. Thanks!
KnomDeGuerre
+4  A: 

I'm not familiar with the .NET implementations, but there are a few things you can do in general to speed up processing of large documents:

  • Avoid using "//" in Xpath expressions unless absolutely necessary.
  • If you only need the first or only element that matches an Xpath expression, use the "[1]" qualifier, e.g. "//iframe[1]". Many processors implement optimizations for this.
  • Whenever possible, when dealing with huge XML input, see if you can design a solution around a stream-based parser (like SAX) instead of a DOM-based parser.
Marco
+3  A: 

Normally, if you see a non-linear increase in processing time vs. input size, you should suspect your code more than the framework. But since the problem goes away when the tool is compiled with .NET 2.0, all bets are off.

With XSLT, it's hard to create a non-linear performance curve if you do all your parsing with straight template matches:

<xsl:template match="foo">
  <!--OUTPUT-->
  <xsl:apply-templates / >
  <!--OUTPUT-->
</xsl:template>

 <xsl:template match="bar">
  <!--OUTPUT-->
  <xsl:apply-templates / >
  <!--OUTPUT-->
</xsl:template>

Pay careful attention to anywhere you might have resorted to <xsl:for-each> for parsing; template matches are virtually always a better way to achieve the same result.

One way to troubleshoot this performance problem is to recreate your XSLT one template-match at a time, testing the processing time after adding each match. You might start with this match:

<xsl:template match="*">
  <xsl:copy>                   <!--Copy node                   -->
    <xsl:copy-of select="@*"/> <!--Copy node attributes         -->
    <xsl:apply-templates />    <!--Process children             -->
  </xsl:copy>
</xsl:template>

This will match and copy every node, one at a time, to a new document. This should not exhibit a non-linear increase in processing time vs. input size (if it does, then the problem is not with your XSLT code).

As you recreate your XSLT, if you add a template-match that suddenly kills performance, comment out every block inside the template. Then, uncomment one block at a time, testing the processing time each iteration, until you find the block that causes the problem.

trebormf
+1  A: 

To Marco - one of your cautions does apply, and one of your recommendations can be implemented to resolve the caution:

One template in my transform does indeed have "//". The top-level node will only ever occur once, in any of the expected source files, so I changed it to use the [1] explicit index. For example, I changed:

<xsl:for-each select="//DataSet/Data/SomethingElse">

to

<xsl:for-each select="DataSet[1]/Data/SomethingElse">

(There can be multiple SomethingElse's in the source files, but only ever one DataSet per file.)

A quick check shows that this is not the major cause of slowdown; I started the transform going about five minutes ago and it is still going. I'll let it go overnight and see what happens. Thanks!

KnomDeGuerre
As I mentioned in my post, try to avoid `<xsl:for-each>`. Borrowing from your example, it would be better to see `<xsl:template match="SomethingElse">`. I'll see if I can find a link to explain why.
trebormf
+1  A: 

One thing world checking is if your XSLT does lookups into other parts of the XML document a lot, i.e. you are in one context node and lookup a value in another part of the document or even another document. If you are doing this it can hit performance quite hard and you should consider using xsl:key and the key function for this. It tells the processor to implement fast lookup index on the data in question.

I was once building an XSLT that took 8 hours to run (with lots of cross references) and switching to use keys gave it a massive speed boost.

Matt Large
Thanks! I don't have any xsl:key in my xsl.
KnomDeGuerre
+1  A: 

Upon looking up your problem, I found a KB at Microsoft about that. You can see it here.

They say that the XSLT transform in .NET 1 has some issues with the performance and that they can offer a quick fix.

If you want to try to troubleshoot the problem, there is a XSLT profiler available here.

Otherwise, you can see what links is given on the Microsoft website for optimizing speed issues with XSLT (link).

Maxim
Thanks! No xsl:key in my transform though.
KnomDeGuerre
A: 

An update of activities I have done since:

To Treb - those two sample pieces of code were already inside of a template which, for the huge source file in question, will only get called once.

Meanwhile, since everything I need from inside the source file is under the //DataSet/Data section (of which there is only one), I wrapped everything inside a

<xsl:template match="DataSet/Data">

wrapper. The template mentioned earlier gets called by a 'if' check in the main section. I couldn't figure out how to get the called template to work with this new wrapper, so got rid of it entirely and incorporated its fairly simple guts into the 'if' section directly.

The result is still extremely slow, so I am tempted to accept the earlier answer from Treb because of the "all bets are off" part.

KnomDeGuerre
I've added an additional troubleshooting suggestion to my original answer in hopes that it will help you. Please take a look.
trebormf
A: 

To anyone who would like more details to work with ... I can't post exact samples for propriatery reasons, but here is a more detailed description of what I am trying to accomplish:

The source file is like a flat table of values. I need to transform it into a sectioned or outlined table, where each section begins when one specific column value in the flat table changes. For example, suppose the flat table contents are:

1 0 0 0 0 
1 0 1 0 0 
1 2 0 0 0 
1 1 1 0 0 
2 0 0 0 0 
2 0 1 0 0 
2 2 0 0 0 
2 1 1 0 0

The output section headers should be floated up from the first column of the flat table. So the transformed output would be like this:

1
    0 0 0 0 
    0 1 0 0 
    2 0 0 0 
    1 1 0 0 
2
    0 0 0 0 
    0 1 0 0 
    2 0 0 0 
    1 1 0 0

etc

To detect when to start a new section, I did this:

<xsl:if test="@TheFirstCol>preceding-sibling::*[1]/@TheFirstCol">

Could this be causing a lot or re-iteration? If so, any less iterative way to compare the value of an attribute in the current node against the value from the previous node?

Thanks!

KnomDeGuerre
The <xsl:if> test above looks fine. It's quite normal to "look back" at previous values.Sorry for my slow response (I'm traveling).
trebormf
+3  A: 

> To detect when to start a new section, I did this:

>

> <xsl:if test="@TheFirstCol>preceding-sibling::*[1]/@TheFirstCol"

> Could this be causing a lot or re-iteration?

Definitely. The algorithm you've chosen is O(N^2) and would be very slow with sufficient number of siblings, regardless of the implementation language.

Here is an efficient algorithm using keys:

Solution1:

`<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:key name="kC1Value" match="@c1" use="."/>

<xsl:template match="/">
  <xsl:for-each select=
  "*/x[generate-id(@c1) = generate-id(key('kC1Value',@c1)[1])]">

   <xsl:value-of select="concat('&#xA;',@c1)"/>

   <xsl:for-each select="key('kC1Value',@c1)">
     <xsl:value-of select="'&#xA;'"/>
     <xsl:for-each select="../@*[not(name()='c1')]">
       <xsl:value-of select="concat('   ', .)"/>
     </xsl:for-each>
   </xsl:for-each>
  </xsl:for-each>
</xsl:template>

`

Unfortunately, XslTransform (.NEt 1.1) has a notoriously inefficient implementation of the generate-id() function.

The following may be faster with XslTransform:

Solution2:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="text"/>

<xsl:key name="kC1Value" match="@c1" use="."/>

<xsl:template match="/">
  <xsl:for-each select=
  "*/x[count(@c1 | key('kC1Value',@c1)[1]) = 1]">

   <xsl:value-of select="concat('&#xA;',@c1)"/>

   <xsl:for-each select="key('kC1Value',@c1)">
     <xsl:value-of select="'&#xA;'"/>
     <xsl:for-each select="../@*[not(name()='c1')]">
       <xsl:value-of select="concat('   ', .)"/>
     </xsl:for-each>
   </xsl:for-each>
  </xsl:for-each>
</xsl:template>

When applied on the following small xml document:

<t>

<x c1="1" c2="0" c3="0" c4="0" c5="0"/>

<x c1="1" c2="0" c3="1" c4="0" c5="0"/>

<x c1="1" c2="2" c3="0" c4="0" c5="0"/>

<x c1="1" c2="1" c3="1" c4="0" c5="0"/>

<x c1="2" c2="0" c3="0" c4="0" c5="0"/>

<x c1="2" c2="0" c3="1" c4="0" c5="0"/>

<x c1="2" c2="2" c3="0" c4="0" c5="0"/>

<x c1="2" c2="1" c3="1" c4="0" c5="0"/>

<x c1="3" c2="0" c3="0" c4="0" c5="0"/>

<x c1="3" c2="0" c3="1" c4="0" c5="0"/>

<x c1="3" c2="2" c3="0" c4="0" c5="0"/>

<x c1="3" c2="1" c3="1" c4="0" c5="0"/>

<x c1="3" c2="0" c3="0" c4="0" c5="0"/>

<x c1="3" c2="0" c3="1" c4="0" c5="0"/>

<x c1="3" c2="2" c3="0" c4="0" c5="0"/>

<x c1="3" c2="1" c3="1" c4="0" c5="0"/>

<x c1="4" c2="0" c3="0" c4="0" c5="0"/>

<x c1="4" c2="0" c3="1" c4="0" c5="0"/>

<x c1="4" c2="2" c3="0" c4="0" c5="0"/>

<x c1="4" c2="1" c3="1" c4="0" c5="0"/>

<x c1="5" c2="0" c3="0" c4="0" c5="0"/>

<x c1="5" c2="0" c3="1" c4="0" c5="0"/>

<x c1="5" c2="2" c3="0" c4="0" c5="0"/>

<x c1="5" c2="1" c3="1" c4="0" c5="0"/>

<x c1="5" c2="0" c3="0" c4="0" c5="0"/>

<x c1="5" c2="0" c3="1" c4="0" c5="0"/>

<x c1="6" c2="2" c3="0" c4="0" c5="0"/>

<x c1="6" c2="1" c3="1" c4="0" c5="0"/>

<x c1="6" c2="0" c3="0" c4="0" c5="0"/>

<x c1="6" c2="0" c3="1" c4="0" c5="0"/>

<x c1="6" c2="2" c3="0" c4="0" c5="0"/>

<x c1="6" c2="1" c3="1" c4="0" c5="0"/>

<x c1="7" c2="0" c3="0" c4="0" c5="0"/>

<x c1="7" c2="0" c3="1" c4="0" c5="0"/>

<x c1="7" c2="2" c3="0" c4="0" c5="0"/>

<x c1="7" c2="1" c3="1" c4="0" c5="0"/>

<x c1="8" c2="0" c3="0" c4="0" c5="0"/>

<x c1="8" c2="0" c3="1" c4="0" c5="0"/>

<x c1="8" c2="2" c3="0" c4="0" c5="0"/>

<x c1="8" c2="1" c3="1" c4="0" c5="0"/>

</t>

both solutions produced the wanted result:

1

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

2

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

3

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

4

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

5

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

  0 0 0 0

  0 1 0 0

6

  2 0 0 0

  1 1 0 0

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

7

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

8

  0 0 0 0

  0 1 0 0

  2 0 0 0

  1 1 0 0

From the above small xml file I generated a 10MB xml file by copying every element 6250 times (using another XSLT transformation :) ).

With the 10MB xml file and with XslCompiledTransform (.Net 2.0 + ) the two solutions had the following transformation times:

Solution 1: 3.3sec.

Solution2: 2.8sec.

With XslTransform (.Net 1.1) Solution2 run for 1622sec., that is about 27 minutes.

Hope this helped.

Cheers,

Dimitre Novatchev

Dimitre Novatchev
Yes, very helpful. Thank you, Dimitre!
KnomDeGuerre
I just saw that the <xsl:stylesheet> and <xsl:key> instructions were not being displayed -- this is now corrected.
Dimitre Novatchev