views:

87

answers:

3

Is there a better approach to parse an invalid HTML then applying Tidy on it?

Side Note : There are some situation when you can't have Tidy available. Regexp is also not recommended I understood for parsing html.

+3  A: 

I would try something like this: http://php.net/manual/en/domdocument.loadhtml.php

From that page:

The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object.

Rob
A: 

SimpleHTMLDOM is known to be more lenient than PHP's native DOM functions.

Pekka
Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
@Gordon this time you were too quick :) He is looking to parse broken HTML.
Pekka
@Pekka which all DOM based parsers should be able to handle fine when using [libxml's HTML parser module](http://xmlsoft.org/html/libxml-HTMLparser.html).
Gordon
@Gordon nice, wasn't aware of the difference!
Pekka
A: 

This question is a duplicate of http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php

demonkoryu