tags:

views:

68

answers:

2

Update: html5lib (bottom of question) seems to get close, I just need to improve my understanding of how it's used.

I am attempting to find an HTML5-compatible DOM parser for PHP 5.3. In particular, I need to access the following HTML-like CDATA within a script tag:

<script type="text/x-jquery-tmpl" id="foo">
    <table><tr><td>${name}</td></tr></table>
</script>

Most parsers will end parsing prematurely because HTML 4.01 ends script tag parsing when it finds ETAGO (</) inside a <script> tag. However, HTML5 allows for </ before </script>. All of the parsers I have tried so far have either failed, or they are so poorly documented that I haven't figured out if they work or not.

My requirements:

  1. Real parser, not regex hacks.
  2. Ability to load full pages or HTML fragments.
  3. Ability to pull script contents back out, selecting by the tag's id attribute.

Some failing parsers:


DOMDocument

Source:

<?php

header('Content-type: text/plain');
$d = new DOMDocument;
$d->loadHTML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

Output:

Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><head><script id="foo"><td>bar</script></head></html>


FluentDOM

Source:

<?php

header('Content-type: text/plain');
require_once 'FluentDOM/src/FluentDOM.php';
$html = "<html><head></head><body><script id='foo'><td></td></script></body></html>";
echo FluentDOM($html, 'text/html');

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
<html><head></head><body><script id="foo"><td></script></body></html>


phpQuery

Source:

<?php

header('Content-type: text/plain');

require_once 'phpQuery.php';

phpQuery::newDocumentHTML(<<<EOF
<script type="text/x-jquery-tmpl" id="foo">
<td>test</td>
</script>
EOF
);

echo (string)pq('#foo');

Output:

<script type="text/x-jquery-tmpl" id="foo">
<td>test
</script>


html5lib

Possibly promising. Can I get at the contents of the script#foo tag?

Source:

<?php

header('Content-type: text/plain');

include 'HTML5/Parser.php';

$html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>";
$d = HTML5_Parser::parse($html);

echo $d->saveHTML();

Output:

<html><head></head><body><script id="foo"><td></td></script></body></html>
+1  A: 
Alan Storm
Thanks for the pointers. How can I dig down to the contents of the script tag, searching by id?
Adam Backstrom
It's a standard DOMDocument object. If you're not comfortable with the DOMDocument, then call the saveXML method (as above) and create a SimpleXml object out of it. If you're not comfortable with Simple XML, you should <a href="http://us.php.net/manual/en/simplexml.examples-basic.php">read the manual</a>
Alan Storm
Added html5lib to [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662)
Gordon
@Alan I hit a wall (well, got mildly annoyed) when I couldn't get `$dom->getElementById()` to work on the resulting DOMDocument. I ended up working around the problem, but I'd be interested to know why it fails and if it can be made to work.
Adam Backstrom
Because DOMDocument is a confusing pile of over engineered poorly document XML processing? For getElementById to work with DOM documents you need to have a DTD that says which attribute name is an ID, or explicitly set which attribute name on an element is an ID. Whenever I have a DOMDocument I save out an XML string to feed into SimpleXML, and then use the xPath functions to get at what I want.
Alan Storm
@Gordon, thanks!
Alan Storm
@Adam More info on why your call wasn't working. Sort of went beyond the 600 character limit :) http://alanstorm.com/domdocument_php_stop
Alan Storm
@Adam no problem. You might also be interested in my answer to [Simplify PHP DOM XML Parsing](http://stackoverflow.com/questions/3405117/simplify-php-dom-xml-parsing-how/3405651#3405651). Also, the id attributes in DOM example in your blog post are not unique, so even if they were proper xml:id attributes, the XML wouldnt be valid.
Gordon