tags:

views:

66

answers:

2

I'm using a PHP function to split text into blocks of max N chars. Once each block is "treated" somehow, it is concatenated back again. The problem is that the text can be HTML... and if the split occurs between open html tags, the "treatment" gets spoiled. Can someone give a hint about breaking text only between closed tags?

Requirements:

  • Max block length: N
  • There are NO <body> tags
  • There are NO <HTML> tags
  • There are NO <head> tags

Adding a sample: (max block length = 173)

<div class="myclass">
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer dapibus sagittis lacus quis cursus.
</div>
<div class="anotherclass">
Nulla ligula felis, adipiscing ac varius et, sollicitudin eu lorem. Sed laoreet porttitor est, sit amet vestibulum massa pretium et. In interdum auctor nulla, ac elementum ligula aliquam eget
</div>

In the text above, given 173 chars as the limit, text would break @ "adipiscing", however that would break the <div class="anotherclass">. In this case, the split shall occur at the first closing, although being shorter the the max limit.

+1  A: 

The "correct" way would be to parse the HTML and perform the shortening operations on its text nodes. In PHP5 you could use the DOM extension, and specifically DOMDocument::loadHTML().

You
You have to be sure that the tags are coded correctly using this tool? What if tags are spoiled?
Riccardo
"Unlike loading XML, HTML does not have to be well-formed to load." -- the results may be unexpected but it should be able to at least parse it. Also from the `loadHTML` manual page: "DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does."
You
@Riccardo DOM can load HTML even if it is invalid. You wont be able to use getElementById but everything else should work. If DOM throws warnings about the markup, you can enable custom error handling and clear the errors. See http://kore-nordmann.de/blog/0081_parse_html_extract_data_from_html.html
Gordon
A: 

Hmmm I've used a code where I had to split the copy entered by a WYSIWYG and wanted to retrieve the first paragraph from it. Its little dodgy but did the trick for me. If you wanted to add in show "n" then you could add that to the "intro" var using substr. Hope this starts you off :-|

function break_html_description_to_chunks($description = null)
{
    $firstParaEnd = strpos($description,"</p>");
    $firstParaEnd += 4;
    $intro = substr($description, 0, $firstParaEnd);

    $body = substr($description, $firstParaEnd, strlen($description));
    $temp = array("intro" => $intro, "body" => $body);
    return $temp;
}
PHPology
Thanks..........!
Riccardo