tags:

views:

51

answers:

3

I need a solution to this situation please.

Situation is a string that results in something like this:

<p>This is some text and here is a <strong>bold text then the post stop here....</p>

Because the function returns a teaser (summary) of the text, it stops after certain words. Where in this case the tag strong is not closed. But the whole string is wrapped in a paragraph.

Is it possible to convert the above result/output to the following:

<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>

I do not know where to begin. The problem is that.. I found a function on the web which does it regex, but it puts the closing tag after the string.. therefore it won't validate because I want all open/close tags within the paragraph tags. The function I found does this which is wrong also:

<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>

Please share your ideas, I am trying everything!

I want to know that the tag can be strong, italic, anything. That's why I cannot append the function and close it manually in the function. Any pattern that can do it for me?

A: 

Using a regular expression isn't an ideal approach for this. You should use an html parser instead to create a valid document object model.

As a second option, depending on what you want, you could use a regex to remove any and all html tags from your string before you put it in the <p> tag.

JoshD
Why would you suggest a HTML parser in Python when there a PHP alternatives? The question **is** tagged PHP.
Russell Dias
Yeah, I completely misread that tag...
JoshD
+2  A: 

There are numerous other variables that need to be addressed to give a full solution, but are not covered by your question.

However, I would suggest using something like HTML Tidy and in particular the repairFile or repaireString methods.

Russell Dias
The problem is that the string gets input as html formatted input, sometimes the teaser exits where one or more tags are open and not closed because the closing tags are in the full article.. i just want to close those open tags right before the teaser end. Thought regex could do that
Ahmad Fouad
You can use the HTML Tidy option. Regex is frowned upon when accessing HTML elements simply because parsing HTML is too irregular and encompassing all of its idiosyncrasies in regex is a monstrous task. I would recommend you at least try the HTML Tidy option...
Russell Dias
This is the proper way to go.
Flavius
A: 

Here is a function i've used before, which works pretty well:

function closetags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
} 

Personally though, I would not do it using regexp but a library such as Tidy. This would be something like the following:

$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
    'output-xml' => true,
    'input-xml' => true
));
echo $clean;
alexn
Please explain that downvote. The tidy method works perfect.
alexn
You know ! that regex did it.. no validation error or anything.
Ahmad Fouad
Yeah, that regexp works really well and is a good alternative if Tidy is not available.
alexn
Can I ask about Tidy thing. Is it available on all servers? Some clients still rely on php4.. is this ok ? or it's something that can be attached to script.. ? :)
Ahmad Fouad