ansaurus

Question

Answer 1

+2 A:

Try this, although admittedly the translate call's a bit ugly:

<xsl:template match="field">
  <xsl:value-of select="string-length(translate(normalize-space(.),'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',''))+1" />
</xsl:template>

This of course requires that the string in the translate call includes all characters that could appear in the field, other than spaces. It works by first calling normalize-space(.) to strip out both double-spaces and all but the text content. It then removes everything except spaces, counts the length of the resulting string and adds one. It does mean if you have Mytext test this will count as 2, as it will consider Mytext to be one word.

If you need a more robust solution, it's a little more convoluted:

<xsl:template match="field">
  <xsl:call-template name="countwords">
    <xsl:with-param name="text" select="normalize-space(.)" />
  </xsl:call-template>
</xsl:template>

<xsl:template name="countwords">
  <xsl:param name="count" select="0" />
  <xsl:param name="text" />
  <xsl:choose>
    <xsl:when test="contains($text,' ')">
      <xsl:call-template name="countwords">
        <xsl:with-param name="count" select="$count + 1" />
        <xsl:with-param name="text" select="substring-after($text,' ')" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise><xsl:value-of select="$count + 1" /></xsl:otherwise>
  </xsl:choose>
</xsl:template>

This passes the result of normalize-space(.) into a recursive named template that calls itself when there's a space in $text, incrementing it's count parameter, and chopping off the first word each time using the substring-after($text,' ') call. If there's no space, then it treats $text as a single word, and just returns $count + 1 (+1 for the current word).

Bear in mind that this will include ALL text content within the field, including those within inner elements.

EDIT: Note to self: read the question properly, just noticed you needed more than just a word count. That's significantly more complicated to do if you want to include any xml tags, but a slight modification of the above is all it takes to spit out each word rather than simply count them:

<xsl:template name="countwords">
  <xsl:param name="count" select="0" />
  <xsl:param name="text" />
  <xsl:choose>
    <xsl:when test="$count = 30" />
    <xsl:when test="contains($text,' ')">
      <xsl:if test="$count != 0"><xsl:text>&#32;</xsl:text></xsl:if>
      <xsl:value-of select="substring-before($text,' ')" />
      <xsl:call-template name="countwords">
        <xsl:with-param name="count" select="$count + 1" />
        <xsl:with-param name="text" select="substring-after($text,' ')" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise><xsl:value-of select="$text" /></xsl:otherwise>
  </xsl:choose>
</xsl:template>

There's an extra <xsl:when clause to simply stop recursing when count hits 30, and the recursive clause outputs the text, after adding a space at the beginning if it wasn't the first word.

EDIT: Ok, here's a solution that keeps the escaped XML content:

<xsl:template match="field">
  <xsl:call-template name="countwords">
    <xsl:with-param name="text" select="." />
  </xsl:call-template>
</xsl:template>

<xsl:template name="countwords">
  <xsl:param name="count" select="0" />
  <xsl:param name="text" />
  <xsl:choose>
    <xsl:when test="starts-with($text, '&lt;')">
      <xsl:value-of select="concat(substring-before($text,'&gt;'),'&gt;')" />
      <xsl:call-template name="countwords">
        <xsl:with-param name="count">
          <xsl:choose>
            <xsl:when test="starts-with(substring-after($text,'&gt;'),' ')"><xsl:value-of select="$count + 1" /></xsl:when>
            <xsl:otherwise><xsl:value-of select="$count" /></xsl:otherwise>
          </xsl:choose>
        </xsl:with-param>
        <xsl:with-param name="text" select="substring-after($text,'&gt;')" />
      </xsl:call-template>
    </xsl:when>
    <xsl:when test="(contains($text, '&lt;') and contains($text, ' ') and string-length(substring-before($text,' ')) &lt; string-length(substring-before($text,'&lt;'))) or (contains($text,' ') and not(contains($text,'&lt;')))">
      <xsl:choose>
        <xsl:when test="$count &lt; 29"><xsl:value-of select="concat(substring-before($text, ' '),'&#32;')" /></xsl:when>
        <xsl:when test="$count = 29"><xsl:value-of select="substring-before($text, ' ')" /></xsl:when>
      </xsl:choose>
      <xsl:call-template name="countwords">
        <xsl:with-param name="count">
          <xsl:choose>
            <xsl:when test="normalize-space(substring-before($text, ' ')) = ''"><xsl:value-of select="$count" /></xsl:when>
            <xsl:otherwise><xsl:value-of select="$count + 1" /></xsl:otherwise>
          </xsl:choose>
        </xsl:with-param>
        <xsl:with-param name="text" select="substring-after($text,' ')" />
      </xsl:call-template>
    </xsl:when>
    <xsl:when test="(contains($text, '&lt;') and contains($text, ' ') and string-length(substring-before($text,' ')) &gt; string-length(substring-before($text,'&lt;'))) or contains($text,'&lt;')">
      <xsl:if test="$count &lt; 30">
        <xsl:value-of select="substring-before($text, '&lt;')" />
      </xsl:if>
      <xsl:call-template name="countwords">
        <xsl:with-param name="count" select="$count" />
        <xsl:with-param name="text" select="concat('&lt;',substring-after($text,'&lt;'))" />
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:if test="$count &lt; 30">
        <xsl:value-of select="$text" />
      </xsl:if>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

If you need any of it explained better, let me know, I'd rather not go into detail unless you need it!

Flynn1179 2010-08-20 11:05:55

The major problem with the first proposed solution is that it will only recognize words in the Latin alphabet and non-words, such as "3X2". The major problem with the second proposed solution is that it will count strings such: "word,word", "word:word", "word.word", "word;word", ..., etc as single words.

Dimitre Novatchev 2010-08-20 12:55:49

Another issue is that strings like "---------------", "_______________", "===========", etc. will be counted as words.

Dimitre Novatchev 2010-08-20 13:16:30

I believe I quite clearly made that point when I said 'This of course requires that the string in the translate call includes all characters that could appear in the field'.

Flynn1179 2010-08-20 13:29:08

Unfortunately these solutions do not account for spaces within the HTML: How We Adapt to Change will Determine Our Success Simplify reimbursement issues with a resource that guides users on their path to choose the right plan for them. Click here to find out more information and sign up for updates

Randy 2010-08-20 13:31:09

As far as counting things like "word,word" as two separate words, you've got two choices: either replace `normalize-space(.)` with `normalize-space(translate(., ',:;.', ' '))`, or add a separate `<xsl:when>` clause in the template for each separator. To be honest though, if you want this level of parsing, I wouldn't recommend doing it with xpath/xslt.

Flynn1179 2010-08-20 13:34:22

@Randy: This solution disregards the html markup altogether; isn't that what you meant when you said the HTML should not count as a word? At what point in the above example would you expect a solution to truncate?

Flynn1179 2010-08-20 13:37:41

How We Adapt to Change will Determine Our Success Simplify reimbursement issues with a resource that guides users on their path to choose the right plan for them. Click here

Randy 2010-08-20 13:45:12

It would be great if it could also include any closing html tags, so in the above examble - How We Adapt to Change will Determine Our Success Simplify reimbursement issues with a resource that guides users on their path to choose the right plan for them. Click here

Randy 2010-08-20 13:48:51

Ah, ok.. my solution will just give you 'How We Adapt to Change will Determine Our Success Simplify reimbursement issues with a resource that guides users on their path to choose the right plan for them. Click here' as an output. It's possible to adapt that template to 'skip over' the html stuff, but I would need to know if your html is escaped as textual content of an xml element or not though.

Flynn1179 2010-08-20 13:49:38

no its not escaped, How We Adapt to Change will Determine Our Success Simplify reimbursement issues with a resource that guides users on their path to choose the right plan for them. Click here to find out more information and sign up for updates

Randy 2010-08-20 13:54:29

@Flynn1179: You just gradually realize how far you are even from complete understanding of the problem Randy has defined -- you still haven't realized some significant problems -- you still don't see them at all. It would be good to consider deleting your answer.

Dimitre Novatchev 2010-08-20 14:36:40

As it was perhaps it wasn't an adequate solution, but that's not a sufficient reason to delete it; the ability to do a word count may be useful to other readers. However, now that the question's been clarified a little, I've added a solution which seems to work as intended.

Flynn1179 2010-08-20 14:39:47

@Flynn1179: your solution is very far from what was asked for. With this XML document: `<field> <html> Thisis it. </html> </field>` it produces: ` Thisis it.` but the wanted result (for limit of two words) is: `<field> <html> Thisis </html> </field>` So there are at least two problems: 1. Not counting words correctly; 2. Losing the markup. So, once again, your answer so far isn't what was meant and wanted -- please, consider providing a relevant answer, or deleting this one.

Dimitre Novatchev 2010-08-20 16:38:58

@Dimitre: That's not what was asked for. Read the comments again, the html is encoded as text with <, they're not xml elements.

Flynn1179 2010-08-20 17:21:10

It appears to be working how I need it. YOU ARE AWESOME!!!

Randy 2010-08-20 17:38:12

NP, but to be honest, I still think xslt/xpath isn't the best way of doing this, it's really not designed for text processing like this, unless of course this is part of a larger process.

Flynn1179 2010-08-20 17:49:57

@Flynn1179: Now that I see Dimitre's comments here, I'm really thinking that this would need some sort of parser because pseudoelement name will affect results. Otherwise, it should be point out all this assumptions: start and end tag marks, and entity references (particulary `<`) are encoded, block style elements have explicit space (inside or next sibling) bettwen they and next word. Also, if it must keep the encode HTML, may be not valid.

Alejandro 2010-08-20 18:05:46

@Flynn1179: Even if the problem is defined as you say ???, your code still produces wrong results. With this input: `<field>MondayTuesday</field>` the result is:`MondayTuesday`I guess the right result should be: `Monday`MondayTuesday

Dimitre Novatchev 2010-08-20 18:37:25

@Randy: I fully agree with @Flynn1179 that parsing unparsed HTML is not a task that is appropriate for XSLT (it *can* be done, but I and nobody with their right senses would do this with XSLT).

Dimitre Novatchev 2010-08-20 18:39:53

I totally agree it's by far a perfect solution, but it certainly produces the wanted output from the given input. @Dimitre, in that example it's doing exactly as it should: leaving the input unchanged if it has less than 30 words in it. Yeah, it'll probably be mis-counting words slightly in some cases, but a more 'perfect' solution is probably overkill for this purpose. @Randy only wanted to trim the field down to 30 words, not stick extra spaces where words are only separated by markup.

Flynn1179 2010-08-20 19:05:12

Incidentally, there is one fairly important flaw with this solution (that I'm surprised nobody else spotted): It leaves all remaining markup in after the 30 words, not just closing tags. For example, if you happen to have something as the 40th word or so, you'll end up with in the output; not sure how much of an issue this is, but the text should still be correct.

Flynn1179 2010-08-20 19:07:08

I would love to be able to use something besides XSLT. However, our CMS is based upon using XML and XSL. Fortunately, the code should be pretty clean. Thank you very much for your help.

Randy 2010-08-20 19:09:57

@Flynn1179: No, I edited your code to get only one word. It is still outputting the two words -- not understanding that they are actually two words -- not one.

Dimitre Novatchev 2010-08-20 19:25:14

@Randy: If you must start with the unparsed XHTML text, then I would first unescape it and parse it, then I would apply XSLT on this parsed XML document. Any other solution with XSLT is too expensive and difficult to maintain -- it may be even difficult to prove that such solution really works.

Dimitre Novatchev 2010-08-20 19:27:38

@Flynn1179 - looks like the empty tags are a problem. Do you have an idea of how to resolve this?

Randy 2010-08-23 18:57:15

Not realistically without doing something horrifically complicated. If you can live with tags not being closed if they're still open after 30 words it's fairly easy to simply have it stop outputting ANYTHING at that point, but it would pretty much need to be able to parse XML otherwise. I'd strongly recommend using something other than XSLT to do this, or look into adding an extension function that your xslt can use, like this for C#: http://www.csharpfriends.com/Articles/getArticle.aspx?articleID=64

Flynn1179 2010-08-24 06:10:01

ansaurus

tags:

views:

answers:

XSLT 1.0 word count with HTML

related questions