views:

75

answers:

2

Hi,

I made this expression to remove all empty (inluding tags with just whitespace) tags in the page.

$content =  preg_replace('/<[^\/>]*>([\s]?)*<\/[^>]*>/', '', $content);

It worked a treat until it had to deal with content like this...

 <blockquote>
<p >foo bar</p>
</blockquote>
<p ><a href="image.jpg" rel="lightbox" title=""><img  title="image" src="image.jpg" /></a><br /></p>

and it outputs it as...

<blockquote>
<p >this is a test for the pluggin</p>
<p ><a href="image.jpg" rel="lightbox" title=""><img  title="image" src="image.jpg" /></a><br /></p>

Thus removing the </blockquote>.

I have been scratching my head on this one and can't get it working. Can anyone see an obvious solution other than specifying what tags it should format? I should also say that it is formatting 'the_content' on a wordpress post.

+4  A: 

Regexps and HTML are not a good match, since HTML is not a regular syntax, and there are no end of edge cases and gotchas. You'll be better off using an HTML parser such as this one and inspecting/manipulating the DOM object.

Brian Agnew
A: 

You might also like to take a look at HTML Purifier, which is more advanced than Simple HTML Dom, if you find it doesn't get all the tags.

David Caunt