tags:

views:

66

answers:

2

I tried several methods to find out what part of a html string is invalid

$dom->loadHTML($badHtml);
$tidy->cleanRepair();
simplexml_load_string($badHtml);

None is clear regarding what part of the html is invalid. Maybe and extra config option for one of the can fix that. Any ideas ?

I need this to manually fix html input from users. I don't want to relay on automated processes.

+1  A: 

I'd try loading the offending HTML into a DOM Document (as you are already doing) and then using simplexml to fix things. You should be able to run a quick diff to see where the errors are.

error_reporting(0);

$badHTML = '<p>Some <em><strong>badly</em> nested</stong> tags</p>';

$doc = new DOMDocument();
$doc->encoding = 'UTF-8';

$doc->loadHTML($badHTML);

$goodHTML = simplexml_import_dom($doc)->asXML();
Nev Stokes
+1  A: 

You can compare cleaned and bad version with PHP Inline-Diff found in answer to that stackoverflow question.

jcubic
Is there an option to preserve html tags and show the difference between strings that have html ?
Maybe try use function `strip_tags`
jcubic