tags:

views:

33

answers:

3

At the moment I have a nice class that generates HTML and allows me to create pages without having to worry about things like closing tags, proper nesting, or clear formatting. The syntax is simple and straight forward,

//Create an anchor tag
$anchor = new Tag("a", array("name"=>"anchor");
//Create a paragraph
$paragraph = Tag::Craft("p", "Lorem ipsum dolor sit amet, consectetur.");
//Create a container for them and add them;
$div = new Tag("p", "id='container'");
$div->add($anchor);
$div->add($paragraph);
echo $div;

Creates:

<div id="container">
  <a name="anchor" />
  <p>Lorem ipsum dolor sit amet, consectetur.</p>
</div>

This is all well and good, I can quickly create tags, fill them with content and other tags, and output them cleanly. However I cannot do things like, take existing html and parse it in. Or find a Tag using something like xPath.

As far as I can tell I have 2 options:

  • Write xPath and parsing functionality into my Tag tool. Time consuming, annoying, and probably effort better spent elsewhere.
  • Use DOM objects. Very sparse documentation and not fully baked in places. Especially since the production environment's PHP is a few subversions behind. Also this will be used for HTML not XML which could cause alot of errors and log spam.

Any thoughts on where I should go from here? Or experience using DOM to achieve this?

+4  A: 

I'd say bite the bullet, go for built in DOM. To give you a few pointers about your concerns:

  • 'sparse documentation': while the PHP manual is perhaps not that verbose (in my opinion enough, but indeed less then some older functionality), it is with very few exceptions the DOM standard: documentation for any implementation of DOM should about work.
  • Not fully baked => care to clarify exactly what you mean?
  • DOM has been around in PHP for a while, if you use PHP 5.0 or 5.1, you can probably use it.
  • Error level of DOM is adjustable, with the DOMDocument->strictErrorChecking property, and with libxml_use_internal_errors(), which you can use to surpress errors / decide for yourself what to do with them.
  • You already have some implementation, and with DOMDocument::registerNodeClass() you can try to keep most of that functionality by extending DOMElement with a the functions & attributes you miss, possibly even autoimport standalone DOMElements in the last used DOMDocument by extending a constructor.
  • The implementation is in quite optimized C, and will probably be both faster and more bugfree (for the time being, maybe you are a great programmer :) ) then your own implementation.

All in all, it depends on the time involved rewriting it to DOMDocument (which you can ease by extending internal classes) or rolling out your own extenstions / additions to your library. If your needs are small and quickly met with rolling out your own, by all means write your own. If you're going the route writing your own XPath implementation (which sounds like fun :) ), be sure to add the whole XPath 1.0 or 2.0 specification: nothing is more frustrating for future developers then an incomplete implementation of the specs when they don't expect it.

Wrikken
+1  A: 

I have not yet run into any issues with parsing well-formed HTML with DomDocument... There are some issues if the HTML is not well formed (Mis-matched tags, no closing >, etc), but with well formed, it's quite easy.

$dom = new DomDocument();
$dom->loadHtml($html);

$xpath = new DomXpath($dom);
$elements = $xpath->query('//div[@id="container"]//p');
foreach ($elements as $element) {
    echo $element->textContent;
}

I find the documentation to be lacking as well. But for the most part, you can typically find what you need to by either playing with it, or looking at the dom specification...

ircmaxell
+1  A: 

The only trouble with PHP's DOM is that it's quite picky about loading malformed HTML. It will choke and flat-out refuse to load a lot of things that most browsers will happily fly over, requiring some pre-loading hacks/cleanups to make it acceptable.

Usually not a problem, but when writing a screen scraper for a site that outputs HTML that would put Adobe Pagemill to shame, it gets a bit tedious.

Marc B