ansaurus

Question

Answer 1

+2 A:

For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( http://github.com/symfony/symfony/tree/master/src/Symfony/Component/DomCrawler/ ). This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: http://www.slideshare.net/fabpot/news-of-the-symfony2-world.

The component is designed to work standalone and can be used without Symfony.

The only drawback is that it will only work with PHP 5.3 or newer.

Timo 2010-09-06 09:19:20

I wish people would stop calling them *jQuery-like* CSS Queries. [CSS Selectors are a W3C recommendation](http://www.w3.org/TR/selectors-api/) and can very much be done without jQuery.

Gordon 2010-09-06 09:52:30

@Gordon: I will try to remember it the next time :-)

Timo 2010-09-06 10:07:26

@Gordon - I agree, but should they even be referred to as _CSS_ selectors? The recommendation merely refers to them as "Selectors" and the closest it gets to "CSS Selectors" is "Selectors, which are widely used in CSS."

LeguRi 2010-09-07 15:10:01

@Richard I don't care if you call them [CSS Selectors](http://www.w3.org/TR/css3-selectors/) or just Selectors, as long as you dont call them jQuery Selectors ;)

Gordon 2010-09-07 19:01:40

Answer 2

+8 A:

Why you shouldn't and when you should use regular expressions?

First off, HTML cannot be properly parsed using regular expressions. Regexes can however extract data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or basic XML parsers are their syntactic cumbersomeness and meager reliability.

Consider that making a somewhat reliable HTML extraction regex:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^"&gt;]+)"[^&gt;]*&gt;([^&lt;&gt;]+)&lt;/a&gt;.+?

is way less readable than a simple phpQuery or QueryPath equivalent:

$div->find(".stationcool a")->attr("title");

There are however specific use cases where they can help. Most XML parsers cannot see HTML document comments <!-- which sometimes however are more useful anchors for extraction purposes. Occasionally regular expressions can save post-processing. And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.

It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /(.+?)/ and process the remainder using the simpler HTML parser methods.

Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

mario 2010-09-06 09:40:53

[`DOMComment`](http://de.php.net/manual/en/class.domcomment.php) can read comments, so no reason to use Regex for that.

Gordon 2010-09-06 09:48:12

Neither SGML toolkits or XML parsers are suitable for parsing real world HTML. For that, only a dedicated HTML parser is appropriate.

Alohci 2010-09-06 09:53:56

@Alohci [`DOM`](http://de.php.net/manual/en/book.dom.php) uses [libxml](http://xmlsoft.org/) and [libxml has a separate HTML parser](http://xmlsoft.org/html/libxml-HTMLparser.html) module which will be used when loading HTML with [`loadHTML()`](http://de.php.net/manual/en/domdocument.loadhtml.php) so it can very much load "real-world" (read broken) HTML.

Gordon 2010-09-06 09:57:29

@Gordon - thanks. HTML parsers and XML parsers are still different things though, even if they're packaged in the same library. And they're both different from DOM implementations.

Alohci 2010-09-06 10:01:49

Well, just a comment about your "real-world consideration" standpoint. Sure, there ARE useful situations for Regex when parsing HTML. And there are also useful situations for using GOTO. And there are useful situations for variable-variables. So no particular implementation is definitively code-rot for using it. But it is a VERY strong warning sign. And the average developer isn't likely to be nuanced enough to tell the difference. So as a general rule, Regex GOTO and Variable-Variables are all evil. There are non-evil uses, but those are the exceptions (and rare at that)... (IMHO)

ircmaxell 2010-09-07 12:11:26

Answer 3

+1 A:

1.Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.

2010-09-07 08:57:59

If you already copy my comments, at least link them properly ;) That should be: Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org).

Gordon 2010-09-07 18:49:13

Good answers are a great source. http://stackoverflow.com/questions/3606792/best-way-to-parse-an-invalid-html-in-php

2010-09-08 12:46:07

Answer 4

A:

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're one of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from a HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use xpath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularily ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because overall less features).
For further informations on the differences see this comparison: http://www.tagbytag.org/articles/phpquery-vs-querypath

And here's a comprehensive QueryPath introduction: http://www.ibm.com/developerworks/opensource/library/os-php-querypath/index.html?S_TACT=105AGX01&S_CMP=HP

Advantages

Simplicity and Reliability
Simple to use alternatives ->find("a img, a object, div a")
Proper data unescaping (in comparison to regular expression greping)

mario 2010-09-07 14:45:45

ansaurus

tags:

views:

answers:

How to parse HTML with PHP?

related questions