views:

272

answers:

0

I'm looking for a way to split a string containing HTML in to two halves. Requirements:

  • Split a string by a number of chars
  • Must not split in the middle of a word
  • Must not include HTML chars when calculating where to split the string

For example take the following string:

<p>This is a test string that contains <strong>HTML</strong> tags and text content. This string needs to be split without slicing through the <em>middle</em> of a word and must preserve the validity of the HTML, i.e. not split in the middle of a tag, and make sure closing tags are respected correctly.</p>

Say I want to split at char position 39, middle of word HTML (not counting html), I would want the function to split the string in to the following two parts:

<p>This is a test string that contains <strong>HTML</strong></p>

and

<p>tags and text content. This string needs to be split without slicing through the <em>middle</em> of a word and must preserve the validity of the HTML, i.e. not split in the middle of a tag, and make sure closing tags are respected correctly.</p>

Notice in the above two example results I would require the the HTML validity be respected, so the closing </strong> and </p> tags were added. Also a starting <p> tag was added to second half as one it closed at the end of the string.

I found this function on StackOverflow to truncate a string by a number of text chars and preserve HTML, but it only goes halfway to want I need, as I need to split in to two halves.

function printTruncated($maxLength, $html)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);

        if ($tag[0] == '&')
        {
            // Handle the entity.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}