tags:

views:

108

answers:

4

I need to clean up some VERY ugly html (think < span>< /span> < em>< /em> < em> < /em> < strong>< /strong> ) over and over again...

I'm looking for a nice and easy preg_replace to eliminate any html tags that contain optional whitespace between them. Your assistance is greatly appreciated!

Oh, and just found this beauty:

< p>< strong>< strong>< /strong>< /strong>< /p>

looks like this will need to live in a while loop as well.

+5  A: 

It's funny how this topic keeps coming up.

Don't go with regex. Try HTML Tidy instead.

Peter Bailey
+2  A: 

If you are looking to really clean up some code, I'd suggest the Tidy class in PHP. There are some examples that might help get you started. (Note this is a front-end to HTML Tidy)

jheddings
Tidy seconded. It's very good. And Eric, welcome to SO.
Pekka
A: 

If you really want a regex, here's one:

s:<(\w+)>\s*<\/\1>::g

Run it multiple times to eliminate nested cases.

Thom Smith
A: 

Well, it looks like tidy WAS the answer:

function cleanupcrap($html){
$tidy_config = array( 
  'clean' => true, 
  'output-xhtml' => true, 
  'show-body-only' => true, 
  'wrap' => 0,
  ); 

 $tidy = tidy_parse_string($html, $tidy_config, 'UTF8'); 
 $tidy->cleanRepair(); 
 return $tidy->value;

}

Eric