views:

1610

answers:

7

I want to truncate some text (loaded from a database or text file), but it contains HTML so as a result the tags are included and less text will be returned. This can then result in tags not being closed, or being partially closed (so Tidy may not work properly and there is still less content). How can I truncate based on the text (and probably stopping when you get to a table as that could cause more complex issues).

substr("Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m a web developer.",0,26)."..."

Would result in:

Hello, my <strong>name</st...

What I would want is:

Hello, my <strong>name</strong> is <em>Sam</em>. I&acute;m...

How can I do this?

While my question is for how to do it in PHP, it would be good to know how to do it in C#... either should be OK as I think I would be able to port the method over (unless it is a built in method).

Also note that I have included an HTML entity &acute; - which would have to be considered as a single character (rather than 7 characters as in this example).

strip_tags is a fallback, but I would lose formatting and links and it would still have the problem with HTML entities.

A: 

This is very difficult to do without using a validator and a parser, the reason being that imagine if you have

<div id='x'>
    <div id='y'>
        <h1>Heading</h1>
        500 
        lines 
        of 
        html
        ...
        etc
        ...
    </div>
</div>

How do you plan to truncate that and end up with valid HTML?

After a brief search, I found this link which could help.

Antony Carthy
+7  A: 

Assuming you are using XHTML, it's not too hard to parse the HTML and make sure tags are handled properly. You simply need to track which tags have been opened so far, and make sure to close them again "on your way out".

<?php
header('Content-type: text/plain');

function printTruncated($maxLength, $html)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);

        if ($tag[0] == '&')
        {
            // Handle the entity.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}


printTruncated(10, '<b>&lt;Hello&gt;</b> <img src="world.png" alt="" /> world!'); print("\n");

printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); print("\n");

printTruncated(10, '<em><b>&lt;Hello&gt;</b>&#20;world!</em>'); print("\n");

Edit: Updated to handle entities as well.

Søren Løvborg
That looks like it might work... although what about HTML entities?
Sam
The code should handle entities correctly now.
Søren Løvborg
+1  A: 

The following is a simple state-machine parser which handles you test case successfully. I fails on nested tags though as it doesn't track the tags themselves. I also chokes on entities within HTML tags (e.g. in an href-attribute of an <a>-tag). So it cannot be considered a 100% solution to this problem but because it's easy to understand it could be the basis for a more advanced function.

function substr_html($string, $length)
{
    $count = 0;
    /*
     * $state = 0 - normal text
     * $state = 1 - in HTML tag
     * $state = 2 - in HTML entity
     */
    $state = 0;    
    for ($i = 0; $i < strlen($string); $i++) {
        $char = $string[$i];
        if ($char == '<') {
            $state = 1;
        } else if ($char == '&') {
            $state = 2;
            $count++;
        } else if ($char == ';') {
            $state = 0;
        } else if ($char == '>') {
            $state = 0;
        } else if ($state === 0) {
            $count++;
        }

        if ($count === $length) {
            return substr($string, 0, $i + 1);
        }
    }
    return $string;
}
Stefan Gehrig
+3  A: 

100% accurate, but pretty difficult approach:

  1. Iterate charactes using DOM
  2. Use DOM methods to remove remaining elements
  3. Serialize the DOM

Easy brute-force approach:

  1. Split string into tags (not elements) and text fragments using preg_split('/(<tag>)/') with PREG_DELIM_CAPTURE.
  2. Measure text length you want (it'll be every second element from split, you might use html_entity_decode() to help measure accurately)
  3. Cut the string (trim &[^\s;]+$ at the end to get rid of possibly chopped entity)
  4. Fix it with HTML Tidy
porneL
i upvoted the accurate, but would downvote for the brute force method
Kris
Is the brute force method that bad? First part of it can be made quite accurate (if you're good with regexps), and with Tidy you'll support optional HTML start tags properly (<table><tr><td></tbody></table> is valid HTML4 :), which naive stack-based solution wouldn't.
porneL
A: 
A: 

Thank you so very much. Last method solved the problem. Thanks again.

Stazh