ansaurus

Question

How can I speed up my 'divide and conquer' XSLT template which replaces certain characters in a string?

Answer 1

+5 A:

A very small correction improved the speed in my tests about 17 times.

There are additional improvements, but I guess this will suffice for now ... :)

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:my="my:my">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vChars">"@|#\}&amp;^~/{</xsl:variable>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="text()" name="escape-text">
  <xsl:param name="s" select="."/>
  <xsl:param name="len" select="string-length($s)"/>

  <xsl:choose>
    <xsl:when test="$len >= 2">
        <xsl:variable name="halflen" select="round($len div 2)"/>
        <xsl:variable name="left">
            <xsl:call-template name="escape-text">
                <xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
                <xsl:with-param name="len" select="$halflen"/>
            </xsl:call-template>
        </xsl:variable>
        <xsl:variable name="right">
            <xsl:call-template name="escape-text">
                <xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
                <xsl:with-param name="len" select="$halflen"/>
            </xsl:call-template>
        </xsl:variable>
        <xsl:value-of select="concat($left, $right)"/>
    </xsl:when>
    <xsl:otherwise>
        <xsl:choose>
            <xsl:when test="not(contains($vChars, $s))">
             <xsl:value-of select="$s"/>
            </xsl:when>
            <xsl:when test="$s = '&quot;'">
                <xsl:text>&quot;\&quot;&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '@'">
                <xsl:text>&quot;@&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '|'">
                <xsl:text>&quot;|&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '#'">
                <xsl:text>&quot;#&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '\'">
                <xsl:text>&quot;\\&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '}'">
                <xsl:text>&quot;}&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '&amp;'">
                <xsl:text>&quot;&amp;&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '^'">
                <xsl:text>&quot;^&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '~'">
                <xsl:text>&quot;~&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '/'">
                <xsl:text>&quot;/&quot;</xsl:text>
            </xsl:when>
            <xsl:when test="$s = '{'">
                <xsl:text>&quot;{&quot;</xsl:text>
            </xsl:when>
        </xsl:choose>
    </xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>

Dimitre Novatchev 2010-08-24 16:59:42

you da man on this

Mark Schultheiss 2010-08-24 18:01:40

+1 Trivial but reasonable change. Things can be easy to improve, at times. :-)

Tomalak 2010-08-24 18:13:26

@Dimitre: +1 for improvement.

Alejandro 2010-08-24 18:35:40

+1: Thank you for your response. I'll play around a bit with this, trying to find out which of the changes causes the speedup. The use of `$vChars`? The extra `match="text()"` attribute to the template element of the `escape-text` template? Or maybe the strange little template which matches `node()|@*` - I first have to figure out what that actually does. :-}

Frerich Raabe 2010-08-24 19:35:51

@Frerich Raabe: The identity transform and matching text nodes are there to make this stylesheet works with any input (you've not provided one). The improvement is in discarding firstly characters not match those wich needs to be replaced. That's the meaning of the first `when/@test="not(contains($vChars, $s))"`

Alejandro 2010-08-24 20:04:45

@Frerich: It's the `when test="not(contains(...))"` (i.e. your `otherwise` case) moved to the first position within the `choose`. The most common condition must be checked first, to avoid redundant checks that almost always fail.

Tomalak 2010-08-24 20:05:15

@Frerich-Raabe: @Tomalak and @Alejandro have explained "the big change" to you. The effect can be maximum if you know the frequency of the characters that you are testing for and order your tests in decreasing character frequency.

Dimitre Novatchev 2010-08-24 20:17:02

@Frerich-Raabe: "the strange little template which matches `node()|@*`" represents the most fundamental XSLT design pattern -- bing for `XSLT identity rule` -- or just read this: http://dpawson.co.uk/xsl/sect2/identity.html

Dimitre Novatchev 2010-08-24 20:19:43

@Tomalak: Ah! I thought of `<xsl:choose>` is something like a `switch` in C/C++. Of course it's actually a long `if/else if` chain. D'oh!

Frerich Raabe 2010-08-24 20:22:59

@Frerich: You would notice the same effect in a C/C++ `switch`, because this is nothing more than syntactic sugar for an if/else chain. How would the running code know which of the conditions to check first? They are always checked in the order of their declaration. Apart from that I agree with @Dimitre, you should read about and understand the identity template.

Tomalak 2010-08-24 20:35:21

@Tomalak: A `switch` statement is much more than just syntactic sugar for `if/else`. In a `switch` statement, you can only test integral values, but this makes it possible for the compiler to compute a jump table, so each code path can (possibly) be reached in constant time. Since <xsl:when> allows arbitrary expressions, this is not possible.

Frerich Raabe 2010-08-24 20:49:30

@Frerich: Yes, that's true, but apart from this internal optimization the `case` parts are still tested for in order of declaration. Putting the most likely case last will result in a performance penalty, in C the same way as in XSLT.

Tomalak 2010-08-24 21:00:48

@Tomalak: [`switch` optimization]. AFAIK, the different values in the `case` statements are sorted and the tests aren't done sequentially, but with binary search (O(log2(N)).

Dimitre Novatchev 2010-08-24 21:19:25

@Tomalak, @Dimitre: Let's not get dragged into a C/C++ discussion but just agree that my thinking was wrong. :-)

Frerich Raabe 2010-08-24 21:22:37

@Frerich-Raabe: If one's thinking is never wrong, this person will not develop at all -- he will be dead. :)

Dimitre Novatchev 2010-08-24 21:43:13

Answer 2

+1 A:

How about using EXSLT? The String functions in EXSLT have a function called replace. I think it is something that is supported by quite a few XSLT implementations.

Wilfred Springer 2010-08-24 17:56:41

That's true; unfortunately, replace is not an option in my case since one of the characters to be replaced (`"`) is used as the escape character itself.

Frerich Raabe 2010-08-24 19:33:59

@Frerich, can you elaborate on that? Why would that prevent you from using replace()? However, that point is pretty well moot, because the only implementations of EXSLT's replace function listed are in XSLT, thus not in principle more efficient than doing it in XSLT yourself.

LarsH 2010-08-25 17:06:31

@LarsH: Sorry, my comment was imprecise. Here's an example: consider the string `"\ ` and keep in mind that `"` should be replaced with `"\""` and `\` should be replaced with `"\\"`; no matter which order you execute these two replacements, the result is incorrect since the second replacement will also replace characters which were introduced by the first replacement. [I give up, I can't get the markup right with these bloody backslashes!]

Frerich Raabe 2010-08-25 18:28:36

@Frerich, reading the spec on that function, it sounds like the replacements are supposed to be independent of each other, not operating on the output of each other. "The str:replace function works by replacing each occurence [sic] of a string in the search string list _within the first argument string_ [emphasis mine] by the equivalently positioned node in the replacement node list." I.e. it replaces occurrences in the original string, not occurrences in an intermediate result after applying some replacements. Too much detail here to put in a comment... I'll add an answer.

LarsH 2010-08-25 18:32:52

@Frerich, Never mind, I won't add an answer... I was going to say that exsl replace() takes a nodeset of search strings and a nodeset of replacement strings, so in theory, you can make multiple replacements in one call, and they all search the original string, not each other's output. However there is no known implementation _at all_ of the most recent version of the spec for that function, and while there is an XSLT implementation of an earlier version, I don't find the spec for it. So I'm not going to bother testing it.

LarsH 2010-08-25 18:47:49

@Frerich, actually I looked at the implementation and any given substring can only be replaced once. So it should work fine for your use. If you're interested, I'll post an answer showing the code.

LarsH 2010-08-25 19:35:24

Answer 3

+4 A:

Here is a more improved version, based on @Dimitre's answer:

  <xsl:template match="text()" name="escape-text">
    <xsl:param name="s" select="."/>
    <xsl:param name="len" select="string-length($s)"/>

    <xsl:choose>
      <xsl:when test="$len &gt; 1">
        <xsl:variable name="halflen" select="round($len div 2)"/>
        <!-- no "left" and "right" variables necessary! -->
        <xsl:call-template name="escape-text">
          <xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
        </xsl:call-template>
        <xsl:call-template name="escape-text">
          <xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
        </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <xsl:choose>
          <xsl:when test="not(contains($vChars, $s))">
            <xsl:value-of select="$s"/>
          </xsl:when>
          <xsl:when test="contains('\&quot;', $s)">
            <xsl:value-of select="concat('&quot;\', $s, '&quot;')" />
          </xsl:when>
          <!-- all other cases can be collapsed, this saves some time -->
          <xsl:otherwise>
            <xsl:value-of select="concat('&quot;', $s, '&quot;')" />
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

Should be another tiny bit faster, but I have not benchmarked it. In any case, it's shorter. ;-)

Tomalak 2010-08-24 18:52:23

@Tomalak: Also there is a `"` case that sould be `"\""`. And I think that you have a typo in `$len gt; 1`. As we talk before, that could be `$len > 1`

Alejandro 2010-08-24 19:14:08

@Alejandro: Yeah, that's a typo. I'll fix it. Thanks for the hint with the extra case, too.

Tomalak 2010-08-24 19:22:20

+1: Thanks for your response! The idea of not even needing `left` and `right` is great, I'm stunned this didn't occur to me. Also, collapsing the `xsl:choose` is a nice idea; even if it wouldn't help performance at all, it needs less code and doesn't sacrifice readability.

Frerich Raabe 2010-08-24 19:36:56

Two questions regarding this answer: if you know that `$s` is just one character long, why is a `contains()` call faster than comparing strings using the `=` operator? Also, why is a single `concat()` call better than two `<xsl:text>"</xsl:text>` elements with an `<xsl:value-of select="$s"/>` in between? It would be great if you could elaborate why this version is so much faster than mine. :-)

Frerich Raabe 2010-08-24 19:47:06

I used `contains()` primarily to keep the code short. It is possible that two separate checks against string literals are faster, but the complexity of the function-based check is fairly low, so speed differences should become negligible and can be sacrificed for readability/maintainability. The same goes for `concat()` - actual performance diferences may be much less then you think. At any rate, benchmark it and decide for yourself. If you do, I'd be interested in your findings!

Tomalak 2010-08-24 19:58:54

I think that without left and right variable, this should be faster, because the differences between RTF to string conversion and adjacent text nodes serialization.

Alejandro 2010-08-24 20:36:19

@Alejandro: I'm counting this as a virtual up-vote. ;-)

Tomalak 2010-08-24 20:39:51

@Tomalak, @Frerich-Raabe, @Alejandro: Yes, this is an obvious refactoring. Other things to try: 1. I believe that having a series of `<xsl:value-of>` is faster than `<xsl:value-of select="concat(...)"/>` 2. In `$vChars` also order the chars according to their decreasing frequency.

Dimitre Novatchev 2010-08-24 20:54:02

@Tomalak, @Frerich-Raabe, @Alejandro: I still have one more to try in my sleeve -- but after the workday is over -- in 5 hrs.

Dimitre Novatchev 2010-08-24 20:56:14

@Dimitre: work day was over here 5 hours ago! Time zones are a bitch. ;-) (PS: I seriously enjoy the discussions that unfold in a good question. Shame that there are so many bad questions.)

Tomalak 2010-08-24 21:06:03

@Tomalak: +1 Now is not virtual! ;) Here we have 2 more hours to go.

Alejandro 2010-08-24 21:07:28

@Tomalak: I enjoy the discussion, too. @Alejandro: You are on the East coast, but I'm on the West coast.

Dimitre Novatchev 2010-08-24 21:21:42

@Alejandro, "this should be faster, because the differences between RTF to string conversion and adjacent text nodes serialization" - why would you expect the former to be faster than the latter? Both involve concatenation of string, I would assume...

LarsH 2010-08-25 19:32:07

@LarsH: in the posted stylesheet `$left` and `$rigth` are Result Tree Fragments. The call `concat($left,$right)` is the same as `concat(string($left),string($right))`. So, even `<xsl:copy-of select="$left"/><xsl:copy-of select="$rigth"/>` should be faster.

Alejandro 2010-08-25 21:11:03

@Alejandro, OK, I get it now. I missed the fact that $left and $right were RTFs. So with `concat($left,$right)`, they get converted from RTF to string, then string-concatenated, then made into a new RTF. Whereas with the two adjacent <call-template> instructions, two RTFs are concatenated into an RTF. That makes sense.

LarsH 2010-08-25 22:16:09

@LarsH: With those `call-templates` the result is the instantiated content of the named template: the text nodes output by `value-of`. So you end up with adjacent text nodes. Then, in serialization face those are just merged.

Alejandro 2010-08-25 23:06:09

Answer 4

+7 A:

Another (complementary) strategy would be to terminate the recursion early, before the string length is down to 1, if the condition translate($s, $vChars, '') = $s is true. This should give much faster processing of strings that contain no special characters at all, which is probably the majority of them. Of course the results will depend on how efficient xsltproc's implementation of translate() is.

Michael Kay 2010-08-24 20:38:56

+1 Very good reasoning. PS: I just noticed that I gave the godfather of XSLT his very first up-vote on SO! Great to see you here! :)

Tomalak 2010-08-24 20:46:56

+1 The reasoning is sound; implementing this trick brought the runtime down from 5812ms (this is after implementing Tomalaks and Dimitre's suggestions) to 657ms. I was so surprised about this that I had to diff the output with the previous output.

Frerich Raabe 2010-08-24 20:54:08

@Dr. Kay: +1 5812ms to 657ms really shows an improvement!

Alejandro 2010-08-24 21:06:52

@Tomalak - is there a badge for that? :-) There should be!

LarsH 2010-08-25 16:40:40

Wow... Welcome to SO, Dr. Kay! And of course, your answers deserve to be known. +1.

Dimitre Novatchev 2010-08-26 03:17:04

@Michael-Kay, @LarsH, @Tomalak, @Alejandro, @Frerich-Raabe, @Wilfred-Springer: Who has the fastest solution so far? please send your solutions to me (dnovatchev on the google's mail) and your data files -- I think I could speed up the fastest solution even further.

Dimitre Novatchev 2010-08-26 03:31:39

Accepting this answer since it caused the biggest improvement in runtime speed while requiring the smallest amount of code change. I wished I could accept more than one answer.

Frerich Raabe 2010-09-08 12:18:54

Answer 5

+2 A:

For what it's worth, here's my current version of the escape-text template which incorporates most of the (excellent!) suggestions which people have given in response to my question. For the record, my original version took about 45605ms on average on my sample DocBook document. After that, the runtime was decreased in multiple steps:

Removing the left and right variable together with the concat() call brought the runtime down to 13052ms; this optimization was taken from Tomalak's answer.
Moving the common case (which is: the given character doesn't need any special escaping) first in the inner <xsl:choose> element brought the runtime further down to 5812ms. This optimization was first suggested by Dimitre.
Aborting the recursion early by first testing whether the given string contains any of the special characters at all brought the runtime down to 612ms. This optimization was suggested by Michael.
Finally, I couldn't resist doing a micro optimization after reading a comment by Dimitre in Tomalak's answer: I replaced the <xsl:value-of select="concat('x', $s, 'y')"/> calls with <xsl:text>x</xsl:text><xsl:value-of select="$s"/><xsl:text>y</xsl:text>. This brought the runtime to about 606ms (so about 1% improvement).

In the end, the function took 606ms instead of 45605ms. Impressive!

<xsl:variable name="specialLoutChars">"@|#\}&amp;^~/{</xsl:variable>

<xsl:template name="escape-text">
    <xsl:param name="s" select="."/>
    <xsl:param name="len" select="string-length($s)"/>
    <xsl:choose>
        <!-- Common case optimization: 
             no need to recurse if there are no special characters -->
        <xsl:when test="translate($s, $specialLoutChars, '') = $s">
            <xsl:value-of select="$s"/>
        </xsl:when>
        <!-- String length greater than 1, use DVC pattern -->
        <xsl:when test="$len > 1">
            <xsl:variable name="halflen" select="round($len div 2)"/>
            <xsl:call-template name="escape-text">
                <xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
                <xsl:with-param name="len" select="$halflen"/>
            </xsl:call-template>
            <xsl:call-template name="escape-text">
                <xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
                <xsl:with-param name="len" select="$len - $halflen"/>
            </xsl:call-template>
        </xsl:when>
        <!-- Special character -->
        <xsl:otherwise>
            <xsl:text>&quot;</xsl:text>
            <!-- Backslash and quot need backslash escape -->
            <xsl:if test="$s = '&quot;' or $s = '\'">
                <xsl:text>\</xsl:text>
            </xsl:if>
            <xsl:value-of select="$s"/>
            <xsl:text>&quot;</xsl:text>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Frerich Raabe 2010-08-24 21:19:30

@Frerich-Raabe: 76 times speedup -- not bad, is it? :)

Dimitre Novatchev 2010-08-24 21:24:15

@Frerich-Raabe: On my test files this transformation is 1.45 times faster than the solution in my answer -- I believe we can do even better.

Dimitre Novatchev 2010-08-24 21:29:22

@Dimitre: It's amazing, yes. :-)

Frerich Raabe 2010-08-24 21:29:38

@Frerich Raabe: I think you could refactor this answer in a more compact code with just one `choose` and test in this order: `translate...`, `$len > 1`, quot and backslash test, otherwise.

Alejandro 2010-08-25 21:27:57

@Frerich: Congratulations! This has been one of the more entertaining (and satisfying!) threads on this site. Also, seeing the 1% net improvement for the "string literal" vs. "string function" change made me smile. ;)

Tomalak 2010-08-25 21:58:15

Answer 6

A:

After @Frerich-Raabe published a community wiki answer which combines the suggestions so far and achieves (on his data) a speedup of 76 times -- big congratulations to everybody!!!

I couldn't resist not to go further:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
 <xsl:variable name="specialLoutChars">"@|#\}&amp;^~/{</xsl:variable>

 <xsl:key name="kTextBySpecChars" match="text()"
  use="string-length(translate(., '&quot;@|#\}&amp;^~/', '') = string-length(.))"/>

 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <xsl:template match="text()[key('kTextBySpecChars', 'true')]" name="escape-text">
  <xsl:param name="s" select="."/>
  <xsl:param name="len" select="string-length($s)"/>

  <xsl:choose>
    <xsl:when test="$len >= 2">
        <xsl:variable name="halflen" select="round($len div 2)"/>
        <xsl:call-template name="escape-text">
            <xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
            <xsl:with-param name="len" select="$halflen"/>
        </xsl:call-template>
        <xsl:call-template name="escape-text">
            <xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
            <xsl:with-param name="len" select="$len - $halflen"/>
        </xsl:call-template>
    </xsl:when>
    <xsl:when test="$len = 1">
        <xsl:choose>
            <!-- Common case: the character at hand needs no escaping at all -->
            <xsl:when test="not(contains($specialLoutChars, $s))">
                <xsl:value-of select="$s"/>
            </xsl:when>
            <xsl:when test="$s = '&quot;' or $s = '\'">
                <xsl:text>&quot;\</xsl:text>
                <xsl:value-of select="$s"/>
                <xsl:text>&quot;</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:text>&quot;</xsl:text>
                <xsl:value-of select="$s"/>
                <xsl:text>&quot;</xsl:text>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:when>
  </xsl:choose>
 </xsl:template>
</xsl:stylesheet>

This transformation achieves (on my data) a further speedup of 1.5 times. So the total speedup should be more than 100 times.

Dimitre Novatchev 2010-08-24 21:39:02

Thanks a lot for your efforts! Unfortunately using `xsl:key` here doesn't improve the situation for me, the contrary is the case. That aside, couldn't you have used `$specialLoutChars` in the `kTextBySpecChars` key? That way it wouldn't lack the "{" character either. Should `xsl:key` "compile" the given XPath, so that you can use it more quickly? I'm curious what the reasoning behind this change is.

Frerich Raabe 2010-08-24 22:33:45

@Frerich-Raabe: THat means that your data is very different from mine. The key will help if you have many text nodes, you are escaping all (or most) of them and some of them don't contain special characters. In XSLT 1.0 an `<xsl:key>` definition isn't allowed to contain variable references: http://www.w3.org/TR/xslt#key "*It is an error for the value of either the use attribute or the match attribute to contain a VariableReference*."

Dimitre Novatchev 2010-08-24 23:04:44

@Frerich-Raabe, since we're doing so much benchmarking, can you post a sample data set somewhere (maybe @Dimitre too) so we can compare apples to apples?

LarsH 2010-08-25 19:01:23

@Dimitre, check out my new answer with an almost 3x speedup! :-)

LarsH 2010-08-26 02:54:44

@LarsH: Unfortunately I can't; I didn't forsee this benchmarking, so I always only created timings with the internal software manuals of our company. I wished I had started using some freely available test data from the beginning. :-/

Frerich Raabe 2010-08-26 09:16:53

@Frerich-Raabe: Just `translate(.,'abcdefghijklmnopqrstuvwxyz','Z')` then you'd be able to give us the result. :) THe timings should be the same for the original document and the one you'll give us.

Dimitre Novatchev 2010-08-26 15:31:27

@Dimitre: This is not working for me. Two things: the key predicate has nothing related to the matched text node, just test for any text node with 'true' as key... Plus, it looses the stop-recursion-early test.

Alejandro 2010-08-26 18:02:01

@Alejandro: Good observation -- the value for the `key()` function must be `'false'`.

Dimitre Novatchev 2010-08-26 18:32:38

@Dimitre: Do you know how to set Saxon in Xselerator to give timing? Today I can't find Saxon parameters documentation...

Alejandro 2010-08-26 19:12:16

@Alejandro: Mine (for Saxon 9) are: -Xms1024M -Xmx1024M -jar C:\Xml\Parsers\Saxon\9.0.0.4\J\saxon9.jar -t -repeat:3 -o %out% %xml% %xsl% %param[ name="value"]%

Dimitre Novatchev 2010-08-26 19:31:00

@Alejandro: Mine (for Saxon 6.5.5) are: %xml% %xsl% %out%%param[ name="value"]% and I am starting Saxon.bat: @echo off"C:\Program Files\Java\jre1.6.0_04\bin\java" com.icl.saxon.StyleSheet -t %1 %2 > %3pause

Dimitre Novatchev 2010-08-26 19:34:46

@Dimitre: Thanks!

Alejandro 2010-08-26 19:36:24

@Dimitre - I understand from @Alejandro's earlier comment that `match="text()[key('kTextBySpecChars', 'true')]"` will match *all* text nodes if *any* text nodes have that key. Right? So this template is not working as intended? I'm not sure how you _would_ make it work, while still taking advantage of the key...

LarsH 2010-08-26 22:30:44

@LarsH: `match="text()[key('kTextBySpecChars', false)]"` matches any text() nodes contain special characters. Any other text nodes will be just copied by the identity transform. I suspect that delegating the first check to the xsl:key may be faster than doing it afterwards. What is unfinished here is causing the check for special characters to be skipped (but only the first time) -- I am not sure if this deserves the effort. Of course, I can't explain why I got 1.5 times speedup. I need the fastest algorithm so far and agreed upon data file and then I can try to do some additional improvement.

Dimitre Novatchev 2010-08-26 23:18:27

Answer 7

A:

OK, I'll chip in. Though not as interesting as optimizing the XSLT 1.0 version, you did say that XSLT 2.0 solutions are welcome, so here's mine.

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;

    <xsl:template name="escape-text" match="text()" priority="2">
        <xsl:variable name="regex1">[@|#}&amp;^~/{]</xsl:variable>
        <xsl:variable name="replace1">"$0"</xsl:variable>
        <xsl:variable name="regex2">["\\]</xsl:variable>
        <xsl:variable name="replace2">"\\$0"</xsl:variable>
        <xsl:value-of select='replace(replace(., $regex2, $replace2),
                              $regex1, $replace1)'/>
    </xsl:template>

    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

This just uses a regexp replace() to replace \ or " with "\" or "\"" respectively; composed with another regexp replace() to surround any of the other escapable characters with quotes.

In my tests, this performs worse than Dimitre's most recent XSLT 1.0 offering, by a factor of more than 2. (But I made up my own test data, and other conditions may be idiosyncratic, so I'd like to know what results others get.)

Why the slower performance? I can only guess it's because searching for regular expressions is slower than searching for fixed strings.

Update: using analyze-string

As per @Alejandro's suggestion, here it is using analyze-string:

<xsl:template name="escape-text" match="text()" priority="2">
    <xsl:analyze-string select="." regex='([@|#}}&amp;^~/{{])|(["\\])'>
        <xsl:matching-substring>
            <xsl:choose>
                <xsl:when test="regex-group(1)">"<xsl:value-of select="."/>"</xsl:when>
                <xsl:otherwise>"\<xsl:value-of select="."/>"</xsl:otherwise>
            </xsl:choose>
        </xsl:matching-substring>
        <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
    </xsl:analyze-string>
</xsl:template>

While this seems like a good idea, unfortunately it does not give us a performance win: In my setup, it consistently takes about 14 seconds to complete, versus 1 - 1.4 sec for the replace() template above. Call that a 10-14x slowdown. :-( This suggests to me that breaking and concatenating lots of big strings at the XSLT level is a lot more expensive than traversing a big string twice in a built-in function.

LarsH 2010-08-25 19:27:04

@LarsH: You are transversing the string twice.

Alejandro 2010-08-26 15:03:45

Alejandro 2010-08-26 15:18:25

@Alejandro, good point about traversing the string twice. Still I thought that would be faster than splitting and concatenating so many strings in high-level XSLT. I will try your analyze-string suggestion.

LarsH 2010-08-26 15:45:56

@Alejandro see my update to this answer, regarding analyze-string.

LarsH 2010-08-26 17:16:24

@LarsH: That's extrange! Testing with Altova (I couldn't set rigth Saxon for benchmarking), the `analize-string` solutions run 50% faster than `replace` solution.

Alejandro 2010-08-26 18:07:58

@Alejandro: that *is* strange... were you using the same code as me? including an identity template? I wonder if Saxon's analyze-string() is especially slow, or Altova's replace() is slow, or... We need somebody to build a table of performance times by algorithm and processor, using the same data set and the same machine.

LarsH 2010-08-26 19:08:30

@LarsH: I've copied and pasted your code.

Alejandro 2010-08-26 19:13:32

@LarsH: Thanks to Dimitre, now I can test with Saxon: `analize-string` solution is 5% faster than `replace` solution.

Alejandro 2010-08-26 19:43:28

@Alejandro ... ok ... What version of Saxon? e.g. .NET? I'm using Java version of Saxon 9.2.0.6 (HE and PE). Are you using the same input data as me, and if not, what data?

LarsH 2010-08-26 20:06:09

@LarsH: Same Java version here. Testing with your own provided data at http://www.huttar.net/lars-kathy/tmp/so3558407.zip

Alejandro 2010-08-26 20:15:02

Answer 8

+1 A:

Update: I fixed this to actually work; now, it is not a speedup!

Building off @Wilfred's answer...

After fiddling with the EXSLT replace() function, I decided it was interesting enough to post another answer, even if it's not useful to the OP. It may well be useful to others.

It's interesting because of the algorithm: instead of the main algorithm worked on here (doing a binary recursive search, dividing in half at each recursion, pruned whenever a 2^nth substring has no special characters in it, and iterating over a choice of special characters when a length=1 string does contain a special character), Jeni Tennison's EXSLT algorithm puts the iteration over a set of search strings on the outside loop. Therefore on the inside of the loop, it is only searching for one string at a time, and can use substring-before()/substring-after() to divide the string, instead of blindly dividing in half.

[Deprecated: I guess that's enough to speed it up significantly. My tests show a speedup of 2.94x over @Dimitre's most recent one (avg. 230ms vs. 676ms).] I was testing using Saxon 6.5.5 in the Oxygen XML profiler. As input I used a 7MB XML document that was mostly a single text node, created from web pages about javascript, repeated. It sounds to me like that is representative of the task that the OP was trying to optimize. I'd be interested to see hear what results others get, with their test data and environments.

Dependencies

This uses an XSLT implementation of replace which relies on exsl:node-set(). It looks like xsltproc supports this extension function (possibly an early version of it). So this may work out-of-the-box for you, @Frerich; and for other processors, as it did with Saxon.

However if we want 100% pure XSLT 1.0, I think it would not be too hard to modify this replace template to work without exsl:node-set(), as long as the 2nd and 3rd params are passed in as nodesets, not RTFs.

Here is the code I used, which calls the replace template. Most of the length is taken up with the verbose way I created search/replace nodesets... that could probably be shortened. (But you can't make the search or replace nodes attributes, as the replace template is currently written. You'll get an error about trying to put attributes under the document element.)

<xsl:stylesheet version="1.0" xmlns:str="http://exslt.org/strings"
    xmlns:foo="http://www.foo.net/something" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:import href="lars.replace.template.xsl"/>

    <foo:replacements>
        <replacement>
            <search>"</search>
            <replace>"\""</replace>
        </replacement>
        <replacement>
            <search>\</search>
            <replace>"\\"</replace>
        </replacement>
        <replacement>
            <search>@</search>
            <replace>"["</replace>
        </replacement>
        <replacement>
            <search>|</search>
            <replace>"["</replace>
        </replacement>
        <replacement>
            <search>#</search>
            <replace>"["</replace>
        </replacement>
        <replacement>
            <search>}</search>
            <replace>"}"</replace>
        </replacement>
        <replacement>
            <search>&amp;</search>
            <replace>"&amp;"</replace>
        </replacement>
        <replacement>
            <search>^</search>
            <replace>"^"</replace>
        </replacement>
        <replacement>
            <search>~</search>
            <replace>"~"</replace>
        </replacement>
        <replacement>
            <search>/</search>
            <replace>"/"</replace>
        </replacement>
        <replacement>
            <search>{</search>
            <replace>"{"</replace>
        </replacement>
    </foo:replacements>

    <xsl:template name="escape-text" match="text()" priority="2">
        <xsl:call-template name="str:replace">
            <xsl:with-param name="string" select="."/>
            <xsl:with-param name="search"
                select="document('')/*/foo:replacements/replacement/search/text()"/>
            <xsl:with-param name="replace"
                select="document('')/*/foo:replacements/replacement/replace/text()"/>
        </xsl:call-template>
    </xsl:template>

    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
</xsl:stylesheet>

The imported stylesheet was originally this one.

However, as @Frerich pointed out, that never gave the correct output! That ought to teach me not to post performance figures without checking for correctness!

I can see in a debugger where it's going wrong, but I don't know whether the EXSLT template never worked, or if it just doesn't work in Saxon 6.5.5... either option would be surprising.

In any case, EXSLT's str:replace() is specified to do more than we need, so I modified it so as to

require that the input parameters are already nodesets
as a consequence, not require exsl:node-set()
not sort the search strings by length (they're all one character, in this application)
not insert a replacement string between every pair of characters when the corresponding search string is empty

Here is the modified replace template:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   xmlns:str="http://exslt.org/strings"&gt;
   <!-- By Lars Huttar
    based on implementation of EXSL str:replace() by Jenni Tennison.
    http://www.exslt.org/str/functions/replace/str.replace.template.xsl
    Modified by Lars not to need exsl:node-set(), not to bother sorting
    search strings by length (in our application, all the search strings are of
    length 1), and not to put replacements between every other character
    when a search string is length zero.
    Search and replace parameters must both be nodesets.
    -->

   <xsl:template name="str:replace">
      <xsl:param name="string" select="''" />
      <xsl:param name="search" select="/.." />
      <xsl:param name="replace" select="/.." />
      <xsl:choose>
         <xsl:when test="not($string)" />
         <xsl:when test="not($search)">
            <xsl:value-of select="$string" />
         </xsl:when>
         <xsl:otherwise>
            <xsl:variable name="search1" select="$search[1]" />
            <xsl:variable name="replace1" select="$replace[1]" />

            <xsl:choose>
               <xsl:when test="contains($string, $search1)">
                  <xsl:call-template name="str:replace">
                     <xsl:with-param name="string"
                        select="substring-before($string, $search1)" />
                     <xsl:with-param name="search"
                        select="$search[position() > 1]" />
                     <xsl:with-param name="replace"
                        select="$replace[position() > 1]" />
                  </xsl:call-template>
                  <xsl:value-of select="$replace1" />
                  <xsl:call-template name="str:replace">
                     <xsl:with-param name="string"
                        select="substring-after($string, $search)" />
                     <xsl:with-param name="search" select="$search" />
                     <xsl:with-param name="replace" select="$replace" />
                  </xsl:call-template>
               </xsl:when>
               <xsl:otherwise>
                  <xsl:call-template name="str:replace">
                     <xsl:with-param name="string" select="$string" />
                     <xsl:with-param name="search"
                        select="$search[position() > 1]" />
                     <xsl:with-param name="replace"
                        select="$replace[position() > 1]" />
                  </xsl:call-template>
               </xsl:otherwise>
            </xsl:choose>
         </xsl:otherwise>
      </xsl:choose>
   </xsl:template>

</xsl:stylesheet>

One of the side benefits of this simpler template is that you could now use attributes for the nodes of your search and replace parameters. This would make the <foo:replacements> data more compact and easier to read IMO.

Performance: With this revised template, the job gets done in about 2.5s, vs. my 0.68s for my recent tests of the leading competitor, @Dimitre's XSLT 1.0 stylesheet. So it's not a speedup. But again, others have had very different test results than I have, so I'd like to hear what others get with this stylesheet.

LarsH 2010-08-26 02:52:21

That's Excellent (+1). Could you, please, send me your data file? Just a single unit and also provide the repetition factor. I think I can improve further on based on your solution... :)

Dimitre Novatchev 2010-08-26 03:13:15

@LarsH: Thanks for your efforts! Interestingly, I recently decided to use a `<replacements>` data table as well, so that the string mapping is actually data to the function and not encoded into an `<xsl:choose>`. This made the function quite a bit slower (on one document I currently have at hand it went from 3201ms to 3779ms) but IMHO it's easier to read and to extend.

Frerich Raabe 2010-08-26 09:35:26

@LarsH: I must be doing something wrong here; unfortunately, your function is a *lot* slower on my sample document (the one I gave in my answer needs 3201ms, yours needed 88921ms; maybe it's xsltproc to blame, but in any case, your script didn't yield the same output as the other scripts either. It didn't seem to actually replace any of the special characters. I'm sure it's a mistake on my side.

Frerich Raabe 2010-08-26 09:43:16

@Dimitre, thanks. :-) I've uploaded the data file to http://www.huttar.net/lars-kathy/tmp/so3558407.zip @Frerich, wow, that's a horrible slowdown. I'll try running this script with xsltproc and see if I get the same results. Saxon does have some clever optimizations and maybe that's part of the performance difference we're seeing.

LarsH 2010-08-26 11:27:17

Strange, this morning when I try to run the various implementations, only my XSLT 2.0 version works properly... @Dimitre's most recent XSLT 1.0 version and this EXSL version fail to replace anything. I'll be working on it...

LarsH 2010-08-26 11:48:48

ansaurus

tags:

views:

answers:

How can I speed up my 'divide and conquer' XSLT template which replaces certain characters in a string?

Dependencies

related questions