tags:

views:

33

answers:

3

I have inherited a site with a news section that displays a summary of the news article. For whatever reason the creators decided that displaying the first X characters of the article would be fine. Of course this very quickly led to the summary being something like:

<p>What a mighty fine <a href="blah">da
<p>What a mighty fine and warm <a href="htt
<p>His name was &quot;Emil&qu

Which quite obviously screws with the page, especially when the opening tags aren't even closed.

What I'm after is a way to close all open tags within the string being taken. I really really don't want to use regex to do it. I'm sure there's a nice parser that can do it easily, I just can't seem to find it right now.

+1  A: 

Have you taken a look at Tidy?

Example:

$options = array("show-body-only" => true); 
$tidy = tidy_parse_string("<B>Hello</I> How are <U> you?</B>", $options);
tidy_clean_repair($tidy);
echo $tidy;

Outputs:

<b>Hello</b> How are <u>you?</u> 
seengee
I'll take a look when I'm next at work. Do you happen to know how it'll handle partial tags like @Emil Vikström suggests?
Blair McMillan
At the very least it will close the tags so `what a nice <a href="http` would become `what a nice <a href="http"></a>`
seengee
+1  A: 

The best thing is probably to find a better algorithm for generating the excerpt, for example by running strip_tags before the truncation.

How will you otherwise handle hard-to-find-programmatically errors such as <p>What a mighty fine and warm <a href="htt or <p>His name was &quot;Emil&qu?

Emil Vikström
That's a perfect example of exactly my point - I'll even add it to my question. As for fixing the news section, that'll probably be quite unlikely. I'll be able to justify the cost of fixing the summary, but I doubt I'll be able to justify re-writing a large portion of the news section.
Blair McMillan
Unfortunately I couldn't use the Tidy options listed (which I would have preferred) because it wasn't installed on the server and for portability reasons I couldn't install it. So I had to go with stripping all of the tags out. Not ideal, but good enough.
Blair McMillan
+1  A: 

I would install the PHP bindings for Tidy. You can then use this to clean up an HTML fragment using the following code:

<?php

$fragment = '<p>What a mighty fine <a href="blah">da';

$tidy = new tidy();

$tidy->parseString($fragment,array('show-body-only'=>true),'utf8');
$tidy->cleanRepair();

echo $tidy;
lonesomeday
I'll take a look when I'm next at work. Do you happen to know how it'll handle partial tags like @Emil Vikström suggests?
Blair McMillan
Nothing pretty -- `<p>What a mighty fine and warm <a href="htt"></a></p>`. However, you could then call `strip_tags` on the output to get something nicer, like `What a mighty fine and warm`.
lonesomeday