views:

638

answers:

2

This is my problem: The code snippet below (inside the <xsl:choose>) does not reliably strip <p>, <div> or <br> tags out of a string using a combination of the substring-before() and substring() functions.

The string I'm trying to format is an attribute of a SharePoint SPS 2003 list item - text inputted via a rich text editor. What I ideally need is a catch-all <xsl:when> test that will always just grab the text within the string before a line break (effectively the first paragraph). I thought that:

<xsl:when test="contains(Story, '&#x0a;')='True'">

Would do that, but it doesn't always work as although the rich text editor inserts <br> and <p> tags, it appears that these are not always represented by the &#x0a; value.

Please help - this is driving me nuts. Code:

<xsl:choose>
  <xsl:when test="contains(Story, '&#x0a;')">
    <div>PTAG_OPEN_OR_BR<xsl:value-of select="substring-before(Story,'&#x0a;')" disable-output-escaping="yes"/></div>
  </xsl:when>
  <xsl:when test="contains(Story, '&#x0a;') and contains(Story, 'div>')">
    <div>DTAG<xsl:value-of select="substring-before(substring-after(substring-before(Story, '/div>'), 'div>'),'&#x0a;')" disable-output-escaping="yes"/></div>
  </xsl:when>
  <xsl:when test="contains(Story, '&#x0a;')!='True' and contains(Story, 'br>')">
    <div>BRTAG<xsl:value-of select="substring(Story, 1, string-length(substring-before(Story, 'br>')-1))" disable-output-escaping="yes"/></div>
  </xsl:when>            
  <xsl:otherwise>
    <div>NO_TAG<xsl:value-of select="substring(Story, 1, 150)" disable-output-escaping="yes"/></div>
  </xsl:otherwise>
</xsl:choose>

EDIT:

Will try out your suggestion Tomalak. Thank you.

EDIT: 12/11/09

Only just had chance to try this out. Thanks for your help Tomalak - I have one question in regard to rendering this as html rather than xml. when I call the template removeMarkup, I get the following error message:

Exception: System.Xml.XmlException Message: '<', hexadecimal value 0x3C, is an invalid attribute character. Line 120, position 58.

I'm not sure but I believe that this is because you can't have xslt tags inside other attributes? Is there any way around this?

Thanks Tim

+1  A: 

A <p> or <br> is very probably represented by a <p> or <br> by the editor, not by &#x0a;. ;-)

Line break characters are not required anywhere in HTML, so if the editor decides not to include any line breaks, it's still fine. Relying on line breaks is an error on your part, IMHO.

Apart from that, without sample XML it is anybody's guess what XPath might do the trick for you.

EDIT:

I suggest a template that removes any HTML markup from a string (by recursive string processing). Then you can take the first meaningful bit of text from the result and print it out.

With this input:

<test>
  <Story>&lt;div&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;/div&gt;</Story>
  <Story>&lt;div&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;/div&gt;</Story>
  <Story>The quick brown fox jumped over the lazy dog.&lt;br&gt;The quick brown fox jumped over the lazy dog.</Story>
  <Story>The quick brown fox jumped over the lazy dog.</Story>
</test>

and this stylesheet:

<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>
  <xsl:output method="xml" encoding="utf-8" />

  <xsl:template match="Story">
    <xsl:copy>
      <original>
        <xsl:value-of select="." />
      </original>
      <processed>
        <xsl:variable name="result">
          <xsl:call-template name="removeMarkup">
            <xsl:with-param name="html" select="." />
          </xsl:call-template>
        </xsl:variable>
        <!-- select the bit of text before the '<>' delimiter -->
        <xsl:value-of select="substring-before($result, '&lt;&gt;')" />
      </processed>
    </xsl:copy>
  </xsl:template>

  <!-- this template removes all HTML markup (tags) from a string -->
  <xsl:template name="removeMarkup">
    <xsl:param name="html"  select="''" />
    <xsl:param name="inTag" select="false()" />

    <!-- if we are in a tag, we look for the next '>', otherwise for '<' -->    
    <xsl:variable name="lookFor">
      <xsl:choose>
        <xsl:when test="$inTag">&gt;</xsl:when>
        <xsl:otherwise>&lt;</xsl:otherwise>
      </xsl:choose>
    </xsl:variable>

    <!-- split the input at the current delimiter char -->
    <xsl:variable name="head" select="substring-before(concat($html, '&lt;'), $lookFor)" />
    <xsl:variable name="tail" select="substring-after($html, $lookFor)" />

    <xsl:if test="not($inTag)">
      <xsl:value-of select="$head" />
      <!-- now add a uniqe delimiter after the first actual text -->
      <xsl:if test="translate(normalize-space($head), ' ', '') != ''">
        <xsl:value-of select="'&lt;&gt;'" /> <!-- '<>' as a delimiter -->
      </xsl:if>
    </xsl:if>

    <!-- remove markup for the rest of the string -->
    <xsl:if test="$tail != ''">
      <xsl:call-template name="removeMarkup">
        <xsl:with-param name="html"  select="$tail" />
        <xsl:with-param name="inTag" select="not($inTag)" />
      </xsl:call-template>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

the following result is produced:

<Story>
  <original>&lt;div&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;/div&gt;</original>
  <processed>The quick brown fox jumped over the lazy dog</processed>
</Story>
<Story>
  <original>&lt;div&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;p&gt;The quick brown fox jumped over the lazy dog&lt;/p&gt;&lt;/div&gt;</original>
  <processed>The quick brown fox jumped over the lazy dog</processed>
</Story>
<Story>
  <original>The quick brown fox jumped over the lazy dog.&lt;br&gt;The quick brown fox jumped over the lazy dog.</original>
  <processed>The quick brown fox jumped over the lazy dog.</processed>
</Story>
<Story>
  <original>The quick brown fox jumped over the lazy dog.</original>
  <processed>The quick brown fox jumped over the lazy dog.</processed>
</Story>

Disclaimer: As with all string processing over HTML input, this is not 100% fool proof and certain malformed input can break it.

Tomalak
Thanks for the response. I should have said that <p> and <br> tags don't always result in the line break not 'represented by'As stated in a previous comment I need to capture the text in the first paragraph of a string. Since I can't guarantee which html tag will be used to denote a paragraph - I need to use a line break - unless you can suggest an alternative?
A: 

contains() returns a boolean value, so contains(Story, ' ')='True' implies a casting operation. W3C XSLT specification is unclear about casting priority in comparison of string with boolean, so some XSLT processors will cast the boolean to string, and others will cast string to boolean. In the second case, string(True()) returns 'true' and not 'True'.

Anyway, your test is redundant, just use the boolean value returned by contains():

<xsl:when test="contains(Story, '&#x0a;')">
Erlock
And I agree with Tomalak, a XML sample should help...
Erlock
Thanks for the comment - I have changed the call to contains() as suggested - however as stated before the main issue is that I cannot reliably extract the first paragraph from the string? Any help with this?