tags:

views:

88

answers:

2

Hello,

How would you programmacially abbriviate XHTML to an arbitrary number of words without leaving unclosed or corrupted tags?

E.g. this:

<p>
    Proin tristique dapibus neque. Nam eget purus sit amet leo
    tincidunt accumsan.
</p>
<p>
    Proin semper, orci at mattis blandit, augue justo blandit nulla.
    <span>Quisque ante congue justo</span>, ultrices aliquet, mattis eget,
    hendrerit, <em>justo</em>.
</p>

...abbriviated to 25 words would be:

<p>
    Proin tristique dapibus neque. Nam eget purus sit amet leo
    tincidunt accumsan.
</p>
<p>
    Proin semper, orci at mattis blandit, augue justo blandit nulla.
    <span>Quisque ante congue...</span>
</p>

Thanks,

Nick

+1  A: 

Recurse through the DOM tree, keeping a word count variable up to date. When the word count exceeds your maximum word count, insert "..." and remove all following siblings of the current node, then, as you go back up through the recursion, remove all the following siblings of each of its ancestors.

levand
+1  A: 

You need to think of the XHTML as a hierarchy of elements and treat it as such. This is basically the way XML is meant to be treated. Then just go through the hierarchy recursively, adding the number of words together as you go. When you hit your limit throw everything else away.

I work mainly in PHP, and I would use the DOMDocument class in PHP to help me do this, you need to find something like that in your chosen language.

To make things clearer, here is the hierarchy for your sample:

- p
    - Proin tristique dapibus neque. Nam eget purus sit amet leo
      tincidunt accumsan.
- p
    - Proin semper, orci at mattis blandit, augue justo blandit nulla.
    - span
          - Quisque ante congue justo
    - , ultrices aliquet, mattis eget, hendrerit, 
    - em
          - justo
    - .

You hit the 25 word limit inside the span element, so you remove all remaining text within the span and add the ellipsis. All other child elements (both text and tags) can be discarded, and all subsequent elements can be discarded.

This should always leave you with valid markup as far as I can see, because you are treating it as a hierarchy and not just plain text, all closing tags that are required will still be there.

Of course if the XHTML you are dealing with is invalid to begin with, don't expect the output to be valid.

Sorry for the poor hierarchy example, couldn't work out how to nest lists.