tags:

views:

362

answers:

2

I have a source XHTML document with elements in multiple namespaces that I am transforming into an HTML document (obviously with no namespaces). In my XSL templates I only match elements in the XHTML namespace to remove non-HTML-compatible elements from the result tree. However, in the output, while those elements are gone, the whitespace I used to indent them remains—i.e., lines of irrelevant CR/LFs and tabs.

For example, if this is my input:

<div id="container">
    <svg:svg>
        <svg:foreignObject>
            <img />
        </svg:foreignObject>
    </svg:svg>
</div>

After applying the transformation, this will be the output:

<div id="container">


            <img />


</div>

While my desired output is this:

<div id="container">
    <img />
</div>

This happens using both TransforMiiX (attaching the stylesheet locally in Firefox) and libxslt (attaching the stylesheet server-side with PHP), so I know it's probably the result of some XSL parameter not getting set, but I've tried playing with <xsl:output indent="yes|no" />, xml:space="default|preserve", <xsl:strip-space elements="foo bar|*" />, all to no avail.

This will be implemented server-side so if there's no way to do it in raw XSL but there is a way to do it in PHP I'll accept that.

I know this is not a namespace issue since I get the same result if I remove ANY element.

+1  A: 

The white space you see is from the source document. XSLT default rules say that text nodes should be copied, it does not matter if they are empty or not. To override the default rule, include:

<xsl:template match="text()" />

Alternatively: Spot any <xsl:apply-templates /> (or <xsl:apply-templates select="node()" />) and explicitly specify which children you want to apply templates to. This method might be necessary if your transformation partly relies on the identity template (in which case the empty template for text nodes would be counter-productive).

I have marked up the "insignificant" white space in your snippet the way Word would do it:

<div id="container">¶
····<svg:svg>¶
········<svg:foreignObject>¶
············<img />¶
········</svg:foreignObject>¶
····</svg:svg>¶
</div>

EDIT: You can also modify your identity template like this:

<xsl:template match="node() | @*">
  <xsl:copy>
    <!-- select everything except blank text nodes -->
    <xsl:apply-templates select="
      node()[not(self::text())] | text()[normalize-space() != ''] | @*
    " />
  </xsl:copy>
</xsl:template>

This would remove any blank-only text node (attribute values remain untouched, they are not text nodes). Use <xsl:output indent="yes" /> to pretty-print the result.

Tomalak
I am using the identity template—if I match `text()`, all of my content disappears. However, I'm not sure what you mean by the alternative; can you give me an example?
Hugh Guiney
@Hugh: If your stylesheet heavily relies on the identity template, I recommend @Josh Davis' approach. I've created a shorter and more correct variant of it (he uses an unconditional `<xsl:copy-of select="@*" />`, which is not ideal).
Tomalak
Thank you; I see what you mean. But unfortunately that did not work either. The result is all on one line, even with `<xsl:output indent="yes" />` set.
Hugh Guiney
Tomalak
You're right: looks like libxslt doesn't honor `indent="yes"` when `output="html"` is set. (PHP also has a bug where `$DOMDocument->formatOutput=true` doesn't have any effect on `$DOMDocument->saveHTML()`).I have tried to format via Tidy, but that isn't working either; it puts everything on a new line but doesn't indent them unless I set `indent=true`, which is not what I want since it also indents the contents of block-level elements. But I guess that's a separate issue.
Hugh Guiney
@Hugh: The point is - whitespace in HTML is should be inherently insignificant. Presentation should not suffer from code layout (most of the time it does not, with the notable exception of adjacent inline elements). If you are in pursuit of nice source code format only, maybe you are taking it one step too far.
Tomalak
+1  A: 

You have two ways to achieve your desired result: either you fix your original transformation to handle whitespace differently, or you keep your transformation as-is and you add a second pass to prettify the output. If your original transformation is complicated then I'd recommend the 2-pass approach. You don't want to make your transformation even more complicated or you'll create some corner cases where you don't get the desired results and you'll have to add more special case handling and potentially add bugs to something that used to work, etc...

You should be able to ignore the whitespace nodes by testing them with normalize-text(). Here's how the second pass could look like. If you go with the 1-pass approach, the code will be roughly the same I guess.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"&gt;

    <xsl:output method="xml" indent="yes" />

    <xsl:template match="text()">
        <xsl:if test="normalize-space(.) != ''">
            <xsl:value-of select="."/>
        </xsl:if>
    </xsl:template>

    <xsl:template match="node()">
        <xsl:copy>
            <xsl:copy-of select="@*" />
            <xsl:apply-templates />
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
Josh Davis
The first template, by itself, does ALMOST what I want, but it also removes all of the relevant indentation.But the second template puts the elements I've removed BACK into the result tree.
Hugh Guiney
Also, the parser choked on the expression `normalize-space(.) != \'\'`. If I un-escaped the single-quotes, it worked.
Hugh Guiney
@Hugh: That's because the code snippet shown here is part of a string definition in some programming language, not standalone XSLT (see, it even ends with `;`)
Tomalak
Ah. Thought that was a typo. Thanks.
Hugh Guiney
Sorry, you're right it was a typo: I forgot to clean the string after testing it in PHP. Originally I was about to post the short PHP snippet but I realized all you needed was the XSL, but then I forgot to remove the quotes. It's fixed now.
Josh Davis
Oh, and you're supposed to run that on *the result* of your first transformation. Not the source document.
Josh Davis