views:

57

answers:

3

Hello

I have a problem that looks like this:

My string of text looks like so:

<div>
    content
    <div>
         <div>
         content
              <div>

         </div>
    </div>

If you notice I'm missing some divs and this risks breaking my theme when I use this content elsewhere.

What would be the best way to go about solving a problem like this. This is what I have on my own but often it is not good enough. This function attempts to solve the problem by not fixing it, yet instead box it in to prevent the possibility that the broken html will break my other html.

 function ($string)
 {
     $div_open = substr_count($string, "<div");   
     $div_close = substr_count($string, "</div>"); 

     while ($div_close<$div_open)
     {    
         $string = "$string</div>";
         $div_close = substr_count($string, "</div>");
         if ($i>1000){echo 'pop 3'; exit;}else{$i++;}
     }
     while ($div_close>$div_open)
     {    
         $string = "<div>$string";
         $div_open = substr_count($string, "<div");
         if ($i>1000){echo 'pop 4 '; exit;}else{$i++;}
     }

     return $string;
 }

Is there a better way?

+5  A: 

Very solid way to clean your HTML output is to use Tidy extension of PHP.

You can do the following:

$text = '<div>content<div><div>content<div></div></div>';

$tidy = tidy_parse_string( $text );
$tidy->cleanRepair( );

echo $tidy;

and your HTML output will look like:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<div>content
<div>
<div>content</div>
</div>
</div>
</body>
</html>

There's also quite many settings of Tidy you can play with, so basically it's up to you how your output is gonna look like.

A disadvantage would be that Tidy sometimes likes to do things that you really don't want to see. If your HTML code isn't really messed up badly, I recommend it.

Ondrej Slinták
+1 Tidy is very tidy.
Stewie
Hey Thanks! Will this clean tables too?
atwellpub
Yes, it clears anything that is HTML. Depends on your Tidy settings tho.
Ondrej Slinták
Yes-- I've had to turn away from this solution because of users without the class installed.
atwellpub
A: 

Could load your output into DOMDocument, and try outputting it with formatOutput()? Could work nicely!

danp
+1  A: 

Things like this are so variable, so unpredictable and so hard to nail down once broken, I would never attempt to fix this with my bare hands.

  1. Try and make sure it's not broken in the first place. Put user-submitted content through htmltidy so it's fixed (or at least smoothed over) as soon as data comes in.

  2. Throw it through something like BeautifulSoup. It's pretty magical when it comes to fixing slightly crufted up data and you can ask it to output it in a nice way too. htmltidy can do some of this but it's not as powerful IMO.

  3. Don't rely on one tag for everything. Nesting hundreds of divs will exacerbate this issue. Using HTML5 tags like <summary> and <article> (and others) will help limit the damage to just the dodgy area.

Oli
@Oli, Will surrounding content in, say, <article> prevent the erroneous coding from affecting the surrounding areas?
atwellpub
I found enclosing content in these does not prevent formatting errors
atwellpub