views:

73

answers:

3

I'd like to trim the leading whitespace inside p tags in XML, so this:

<p>  Hey, <em>italics</em> and <em>italics</em>!</p>

Becomes this:

<p>Hey, <em>italics</em> and <em>italics</em>!</p>

(Trimming trailing whitespace won't hurt, but it's not mandatory.)

Now, I know normalize-whitespace() is supposed to do this, but if I try to apply it to the text nodes..

<xsl:template match="text()">
  <xsl:text>[</xsl:text>
  <xsl:value-of select="normalize-space(.)"/>
  <xsl:text>]</xsl:text>
</xsl:template>

...it's applied to each text node (in brackets) individually and sucks them dry:

[Hey,]<em>[italics]</em>[and]<em>[italics]</em>[!]

My XSLT looks basically like this:

<xsl:template match="p">
    <xsl:apply-templates/>
</xsl:template>

So is there any way I can let apply-templates complete and then run normalize-space on the output, which should do the right thing?

+2  A: 

You want:

 <xsl:template match="text()">
  <xsl:value-of select=
   "substring(
       substring(normalize-space(concat('[',.,']')),2),
       1,
       string-length(.)
              )"/>
 </xsl:template>

This wraps the string in "[]", then performs normalize-string(), then finally removes the wrapping characters.

Dimitre Novatchev
@Dimitre Novatchev - I believe the square brackets were used to demonstrate what it is currently doing(stripping out leading and trailing whitespace from each text node). This doesn't achieve the desired output (which hasn't been clearly stated).
Mads Hansen
@Mads Hansen: If the wrapping characters are just for illustrative purposes, which seems likely, then they can be removed after applying `normalize-space()`. I updated my answer to do exactly this and I think this is what the OP wants. This is the only answer so far that normalizes the internal whitespaces in a text node.
Dimitre Novatchev
Interesting idea, but I'm afraid it doesn't actually work -- I get "Hey, ]italics and italics!" when I try? But +1 for the helpful comments to the other answers.
jpatokal
+2  A: 

I would do something like this:

<xsl:template match="p">
    <xsl:apply-templates/>
</xsl:template>

<!-- strip leading whitespace -->
<xsl:template match="p/node()[1][self::text()]">
  <xsl:call-template name="left-trim">
     <xsl:with-param name="s" value="."/>
  </xsl:call-template>
</xsl:template>

This will strip left space from the initial node child of a <p> element, if it is a text node. It will not strip space from the first text node child, if it is not the first node child. E.g. in

<p><em>Hey</em> there</p>

I intentionally avoid stripping the space from the front of 'there', because that would make the words run together when rendered in a browser. If you did want to strip that space, change the match pattern to

match="p/text()[1]"

If you also want to strip trailing whitespace, as your title possibly implies, add these two templates:

<!-- strip trailing whitespace -->
<xsl:template match="p/node()[last()][self::text()]">
  <xsl:call-template name="right-trim">
     <xsl:with-param name="s" value="."/>
  </xsl:call-template>
</xsl:template>

<!-- strip leading/trailing whitespace on sole text node -->
<xsl:template match="p/node()[position() = 1 and
                              position() = last()][self::text()]"
              priority="2">
   <xsl:value-of select="normalize-space(.)"/>
</xsl:template>

The definitions of the left-trim and right-trim templates are at Trim Template for XSLT (untested). They might be slow for documents with lots of <p>s. If you can use XSLT 2.0, you can replace the call-templates with

  <xsl:value-of select="replace(.,'^\s+','')" />

and

  <xsl:value-of select="replace(.,'\s+$','')" />

(Thanks to Priscilla Walmsley.)

LarsH
+1 I don't think it achieves exactly what @jpatokal wants, but it hasn't been stated very clearly. This provides all the information needed to trim the leading space from `p/text()[1]`, which is what I think is wanted.
Mads Hansen
@LarsH: Good answer. I think you want not `p/node()[1][self::text()]` but `p/node()[self::text()][1]` instead. The same for the last text node.
Dimitre Novatchev
@Dimitre: wouldn't that either (a) yield the first/last text node, regardless of whether they were "outside" any non-text children; or (b) do the same as what I had? Please explain further, as I would like to understand this better.
LarsH
@Mads, I don't think he wants to trim the leading space from `p/text()[1]` if p/text()[1] is preceded by an element such as `a`, do you? But I agree, let @jpatokal clarify.
LarsH
@LarsH: `p/node()[1][self::text()]` means: the first node child of `p` but only if it is a text node. While what you want is: The first of all the text node children of `p`
Dimitre Novatchev
@Dimitre: in other words, the expression you suggest would do (a). However I believe the OP wants to strip "the first node child of p but only if it is a text node". E.g. in `<p><em>Hey</em> there</p>` we should not strip the ' ' before 'there', because then it would be rendered with no space between 'Hey' and 'there'. But maybe @jpatokal will clarify.
LarsH
@LarsH: I believe your solution leaves the spaces in the first text node of: `<p><em> Hey </em> there</p>`, while the OP wants them stripped-off. This is due to the fact that you are processing the first child node only if it is also a text node and in this case `em` is not a text node but is the first child node.
Dimitre Novatchev
@Dimitre: I agree with you on what my code does, which is what I understand "from parent element only" to mean. (Note that your suggested expression, `p/node()[self::text()][1]`, does not strip the first space from your example `<p><em> Hey </em> there</p>` either.) I guess we disagree on what the OP wants. Given that you and @Alej have taken good and somewhat different stabs at what you believe @jptokal wants, I won't spend more time on speculative solutions until/unless the OP clarifies.
LarsH
The basic requirement boils down to "never *start* with whitespace", even if it's wrapped in a few containing tags. Space between tags should not be stripped.
jpatokal
However, your solution (with the XSLT 2 replace) is not working for me, eg. plain <p>Foo</p> turns into null? Is $arg magic or is the definition for it missing?
jpatokal
@jpatokal: Sorry, `$arg` should be `.` in the `replace()` call. I'll edit this to fix.
LarsH
@jpatokal: 'Space between tags should not be stripped.' At face value that seems to contradict '"never start with whitespace", even if it's wrapped in a few containing tags.' Maybe you mean 'space between non-space text should not be stripped'?
LarsH
"Space between tags" = between the closing of one tag and the opening of another, eg. `<p/>[ here ]<p/>`. Not the same as `<p><em>[ this ]</em></p>`.
jpatokal
@jpatokal: ok. It would be helpful next time if you would define your requirements more accurately from the beginning, so we avoid wasting time implementing the wrong ones. However, I understand that sometimes, defining the requirements correctly is the biggest part of the problem.
LarsH
@jpatokal: Am I right in thinking you *do* want to strip space between the closing of one tag and the opening of another if there has not been any text yet? E.g. in `<p><span/> <em>Hi</em></p>`. In which case the defining requirement is as you said, the text should never start with whitespace, regardless of the level of embedding; space *after text* should not be stripped.
LarsH
No. "Never start with whitespace" -- if there's a complete tag before the whitespace, then it's not starting with whitespace.
jpatokal
@jpatokal - So "never start with whitespace" ignores multiple opening tags (despite the question title), but does not ignore complete or close tags. Why? I inferred that the goal was HTML that would not render an initial space; but apparently that's not it.
LarsH
A complete or close tag will (presumably) render something, so the HTML does not render an initial space. In other news, the horse is dead. Stop beating it and move on.
jpatokal
@jpatkal, `<span/>`, as in my example, does not render anything. You're free to stop responding whenever you like.
LarsH
@jpatokal: "Stop beating it and move on." I and others spent our time trying to help you at your request. Some of that time was in vain because your question was underspecified. We took the risk of trying to infer the specs and in some cases were wrong. Having spent that time, I'm still interested in nailing down a consistent definition of what the task actually was. Haven't seen it yet. If you're not interested in that, you move on, but rudeness on your part is unjustified.
LarsH
+4  A: 

This stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;
    <xsl:template match="node()|@*">
        <xsl:copy>
            <xsl:apply-templates select="node()|@*"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="p//text()[1][generate-id()=
                                      generate-id(ancestor::p[1]
                                                  /descendant::text()[1])]">
        <xsl:variable name="vFirstNotSpace"
                      select="substring(normalize-space(),1,1)"/>
        <xsl:value-of select="concat($vFirstNotSpace,
                                     substring-after(.,$vFirstNotSpace))"/>
    </xsl:template>
</xsl:stylesheet>

Output:

<p>Hey, <em>italics</em> and <em>italics</em>!</p>

Edit 2: Better expression (now only three function calls).

Edit 3: Matching the first descendant text node (not just the first node if it's a text node). Thanks to @Dimitre's comment.

Now, with this input:

<p><b>  Hey, </b><em>italics</em> and <em>italics</em>!</p>

Output:

<p><b>Hey, </b><em>italics</em> and <em>italics</em>!</p>
Alejandro
Wow. :-) I think I see what the nested substring() calls are doing, and it's much better than a recursive template. +1
LarsH
@Alejandro: I think you have the smae issue as Lars: I think you want not `p/node()[1][self::text()]` but `p/node()[self::text()][1]` instead.
Dimitre Novatchev
@Dimitre: That would be the same as `p/text()[1]`, but I know what you mean.
Alejandro
+1 I think we have a winner! That is what I understand the desired output to be. Very nice solution.
Mads Hansen
@Alejandro: Not exactly, consider: `<p><em> Hello </em></p>`. This would be: `(p//text())[1]`
Dimitre Novatchev
@Alejandro: Oh, I see that you have fixed this. Good, +1
Dimitre Novatchev
Don't know about performance, but it could be just `text()[generate-id()=generate-id(ancestor::p[1]/descendant::text()[1])]`. Or with keys: `<xsl:key name="kIsPFirstDescendant" match="text()" use="generate-id(ancestor::p[1]/descendant::text()[1])"/><xsl:template match="text()[key('kIsPFirstDescendant',generate-id())]">...`
Alejandro
A bit more complicated than I was hoping, but seems to work like a charm. Thanks!
jpatokal
@jpatokal: You are wellcome. As a side: complicated? An identity rule and only other one rule? Pattern is a bit complex because of pattern axis restrictions: you can't say in pattern `p/descendat::text()[1]`, so I've reversed this.
Alejandro
I suppose it's simple by the insane standards of XSLT, but in any sensible programming language this would be `trim()` or `strip()`...
jpatokal
@jpatokal: Mixing things, I think... If you are refering to ltrim or rtrim kind of functions, I'll give you that. It looks like we can manage with just `fn:normalize-space()`. I think that this XPath 1.0 it's no so wrong: `substring-after(.,substring-before(.,substring(normalize-space(),1,1)))` meaning *the string after the string is before first not white space character*. But the whole process (*copy everything as is, but for every text node being the first descendat of a `p` do left trim white spaces*) I really don't think you could express this more compact.
Alejandro