tags:

views:

655

answers:

4

Suggestion for a reference question. Stack Overflow has dozens of "How to parse HTML" questions coming in every day. However, it is very difficult to close as a duplicate because most questions deal with the specific scenario presented by the asker. This question is an attempt to build a generic "reference question" that covers all aspects of the issue.

This is an experiment. If such a reference question already exists, let me know and I'll happily remove this one.

My ideal vision is that each of the three questions gets answered separately, and the best answers to each bubble up to the top.

I will be awarding a 200 bounty to the best answer in each of the three categories two weeks from now, pending discussion of this question on Meta.

Each of these questions have already been answered brilliantly elsewhere, so copy+pasting your own answer to a different question is fine with me.

How do I parse HTML with PHP?

  1. What libraries are there? Which ones use PHP's native DOM, which ones come with their own parsing engine? (Hint: SimpleHTMLDOM)

    1a. I need to find a specific element, but I find it hard to get used to the XPath syntax. Are there any DOM-based libraries that make parsing HTML easier? Please consider making generic real world examples.

  2. Is there a PHP library that enables me to query the DOM using CSS[2/3] selectors, like jQuery does? (Hint: phpQuery) Please consider making generic real world examples.

  3. Bonus question: Why shouldn't I use regular expressions? Please provide a very short answer in layman's terms.

+2  A: 

For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( http://github.com/symfony/symfony/tree/master/src/Symfony/Component/DomCrawler/ ). This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: http://www.slideshare.net/fabpot/news-of-the-symfony2-world.

The component is designed to work standalone and can be used without Symfony.

The only drawback is that it will only work with PHP 5.3 or newer.

Timo
I wish people would stop calling them *jQuery-like* CSS Queries. [CSS Selectors are a W3C recommendation](http://www.w3.org/TR/selectors-api/) and can very much be done without jQuery.
Gordon
@Gordon: I will try to remember it the next time :-)
Timo
@Gordon - I agree, but should they even be referred to as _CSS_ selectors? The recommendation merely refers to them as "Selectors" and the closest it gets to "CSS Selectors" is "Selectors, which are widely used in CSS."
LeguRi
@Richard I don't care if you call them [CSS Selectors](http://www.w3.org/TR/css3-selectors/) or just Selectors, as long as you dont call them jQuery Selectors ;)
Gordon
+8  A: 

Why you shouldn't and when you should use regular expressions?

First off, HTML cannot be properly parsed using regular expressions. Regexes can however extract data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or basic XML parsers are their syntactic cumbersomeness and meager reliability.

Consider that making a somewhat reliable HTML extraction regex:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^"&gt;]+)"[^&gt;]*&gt;([^&lt;&gt;]+)&lt;/a&gt;.+?

is way less readable than a simple phpQuery or QueryPath equivalent:

$div->find(".stationcool a")->attr("title");

There are however specific use cases where they can help. Most XML parsers cannot see HTML document comments <!-- which sometimes however are more useful anchors for extraction purposes. Occasionally regular expressions can save post-processing. And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.

It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/ and process the remainder using the simpler HTML parser methods.

Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

mario
[`DOMComment`](http://de.php.net/manual/en/class.domcomment.php) can read comments, so no reason to use Regex for that.
Gordon
Neither SGML toolkits or XML parsers are suitable for parsing real world HTML. For that, only a dedicated HTML parser is appropriate.
Alohci
@Alohci [`DOM`](http://de.php.net/manual/en/book.dom.php) uses [libxml](http://xmlsoft.org/) and [libxml has a separate HTML parser](http://xmlsoft.org/html/libxml-HTMLparser.html) module which will be used when loading HTML with [`loadHTML()`](http://de.php.net/manual/en/domdocument.loadhtml.php) so it can very much load "real-world" (read broken) HTML.
Gordon
@Gordon - thanks. HTML parsers and XML parsers are still different things though, even if they're packaged in the same library. And they're both different from DOM implementations.
Alohci
Well, just a comment about your "real-world consideration" standpoint. Sure, there ARE useful situations for Regex when parsing HTML. And there are also useful situations for using GOTO. And there are useful situations for variable-variables. So no particular implementation is definitively code-rot for using it. But it is a VERY strong warning sign. And the average developer isn't likely to be nuanced enough to tell the difference. So as a general rule, Regex GOTO and Variable-Variables are all evil. There are non-evil uses, but those are the exceptions (and rare at that)... (IMHO)
ircmaxell
+1  A: 

1.Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.

If you already copy my comments, at least link them properly ;) That should be: Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).
Gordon
Good answers are a great source. http://stackoverflow.com/questions/3606792/best-way-to-parse-an-invalid-html-in-php
A: 

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're one of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from a HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).
For further informations on the differences see this comparison: http://www.tagbytag.org/articles/phpquery-vs-querypath

And here's a comprehensive QueryPath introduction: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&amp;S_CMP=HP

Advantages

  • Simplicity and Reliability
  • Simple to use alternatives ->find("a img, a object, div a")
  • Proper data unescaping (in comparison to regular expression greping)
mario