views:

8323

answers:

10

Are there any robust and mature HTML parsers available for PHP? A quick skimming of PEAR didn't turn anything up (lots of classes for generating HTML, not so much for consuming), and Google taught me a lot of people have started and then abandoned a variety of parser projects.

Not interested in XML parsers (unless then can consume non-well formed HTML) or hacking it on my own with regular expressions.

Clarification of Intent: I'm not interested in filtering of HTML content, I'm interesting in extracting information from HTML documents.

A: 

I've used HTML Purifier with a lot of success on a couple different projects.

mabwi
A quick glances make that look more like a filtering library than a parser. Have you used the parser classes to actually extract information from documents?
Alan Storm
Ah, I answered before the edit. It originally was asking for a filtering library
mabwi
A: 

XML_HTMLSax is rather stable - even if it's not maintained any more. Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools.

troelskn
+6  A: 

Simple HTML Dom is a great open-source parser:

http://simplehtmldom.sourceforge.net/

It treats dom elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.

I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.

Robert Elwell
+1  A: 

You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser.

CesarB
+5  A: 

PHP Simple DOM Parser looks good. I haven't tried using it yet though.

Josh
I'm gonna start using it tomorrow, and looks very nice, thx:)It support xpath and that's just nice for my needs. I also tried http://querypath.org/ but it fails on invalid html(it uses DOMDOcument to load the html...)
Quamis
ive used this many times, very effective and easy to use. Will buy from again A++++, heh.
Patrick
The ability to load invalid/broken HTML is the big argument for SimpleHTMLDOM.
Pekka
Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
+20  A: 

Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

Edward Z. Yang
True. And it works with PHP's built-in XPath and XSLTProcessor classes, which are great for extracting content.
porneL
For really mangled HTML, you can always run it through htmltidy before handing it off to DOM. Whenever I need to scrape data from HTML, I always use DOM, or at least simplexml.
Frank Farmer
I've be re-researching this, and discovered that the problem I was having with DomDocument's loadXML method was due to an older linked version of libxml. I've been working on more up-to-date systems and DomDocument::loadHTML works like a charm.
Alan Storm
Another thing with loading malformed HTML i that it might be wise to call libxml_use_internal_errors(true) to prevent warnings that will stop parsing.
Husky
I have used DOMDocument to parse about 1000 html sources (in various languages encoded with different charsets) without any issues. You might run into encoding issues with this, but they aren't insurmountable. You need to know 3 things: 1) loadHTML uses meta tag's charset to determine encoding 2) #2 can lead to incorrect encoding detection if the html content doesn't include this information 3) bad UTF-8 characters can trip the parser. In such cases, use a combination of mb_detect_encoding() and Simplepie RSS Parser's encoding / converting / stripping bad UTF-8 characters code for workarounds.
Vasu
+1  A: 

html5lib has a PHP version. (I don't know how up-to-date it is.)

Ms2ger
+1  A: 

here is one more parser http://code.google.com/p/wiseparser/ it requires PHP5 and works in manner close to real browsers

Marat
This is a reimplementation of the Perl HTML::Treebuilder class in PHP. It should be helpful if you want that classes behaviour.
ftrotter
A: 

Do any of these HTML parsers keep track of text nodes? Simple HTML Dom is working great for me, except I can't get at text nodes themselves from the DOM.

Ted
+5  A: 

A little late to the party, but may I suggest phpQuery?

http://code.google.com/p/phpquery/

Dan Hulton
That looks pretty interesting, and doing an end run around DOM style parsing makes me happy to boot!
Alan Storm