tags:

views:

161

answers:

2

A little new to PHP parsing here, but I can't seem to get PHP's DomDocument to return what is clearly an identifiable node. The HTML loaded will come from the 'net so can't necessarily guarantee XML compliance, but I try the following:

<?php
header("Content-Type: text/plain");

$html = '<html><body>Hello <b id="bid">World</b>.</body></html>';

$dom = new DomDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = true;

/*** load the html into the object ***/
$dom->loadHTML($html);
var_dump($dom);    

$belement = $dom->getElementById("bid");
var_dump($belement);

?>

Though I receive no error, I only receive the following as output:

object(DOMDocument)#1 (0) {
}
NULL

Should I not be able to look up the <b> tag as it does indeed have an id?

+2  A: 

Well, you should check if $dom->loadHTML($html); returns true (success) and I would try

 var_dump($belement->nodeValue);

for output to get a clue what might be wrong.

EDIT: http://www.php-editors.com/php_manual/function.domdocument-get-element-by-id.html - it seems that DomDocument uses XPath internally.

Example:

$xpath = xpath_new_context($dom);
var_dump(xpath_eval_expression($xpath, "//*[@ID = 'YOURIDGOESHERE']"));
MartyIX
Original post modified to reflect these outputs. Thanks,
Xepoch
+3  A: 

The Manual explains why:

For this function to work, you will need either to set some ID attributes with DOMElement->setIdAttribute() or a DTD which defines an attribute to be of type ID. In the later case, you will need to validate your document with DOMDocument->validate() or DOMDocument->validateOnParse before using this function.

By all means, go for valid HTML & provide a DTD.

Quick fixes:

  1. Call $dom->validate(); and put up with the errors (or fix them), afterwards you can use $dom->getElementById(), regardless of the errors for some reason.
  2. Use XPath if you don't feel like validing: $x = new DOMXPath($dom); $el = $x->query("//*[@id='bid']")->item(0);
  3. Come to think of it: if you just set validateOnParse to true before loading the HTML, if would also work ;P

.

$dom = new DOMDocument();
$html ='<html>
<body>Hello <b id="bid">World</b>.</body>
</html>';
$dom->validateOnParse = true; //<!-- this first
$dom->loadHTML($html);        //'cause 'load' == 'parse

$dom->preserveWhiteSpace = false;

$belement = $dom->getElementById("bid");
echo $belement->nodeValue;

Outputs 'World' here.

Wrikken
I do have validateOnParse. setIdAttribute only would apply to set and then subsequent retrieve? Again though, the HTML will be web-provided so I'm at their mercy, but just trying an example. HTML5 doesn't even have a DTD, yes?
Xepoch
"setIdAttribute only would apply to set and then subsequent retrieve?" -> Yes. HTML5 is not finished yet so it should not have a DTD yet.
MartyIX
DTD would be `<!DOCTYPE HTML>`, but just calling `$dom->validate()` would also work. Put up with the errors or try to generate valid HTML (the latter is more difficult than it seems... :) )
Wrikken
@Xepoch I've never managed to get `getElementById` working when using `DOM` with HTML. But you can substitute `getElementById` with an XPath like `//p[@id="foo"]`
Gordon
@Wrikken doesnt work for me. I'm getting *Trying to get property of non-object* on the `echo` call with PHP 5.3.2 on Vista and libxml 20703
Gordon
Hmm, here it does work, PHP 5.3.2, libxml 2.7.6 (I assume for Windows, 20703 would be 2.7.3), you could try ftp://ftp.zlatkovic.com/libxml/libxml2-2.7.6.win32.zip . Calling `validate()` manually later on also no results?
Wrikken
... and if that doesn't work, have you tried using the example from http://www.php.net/manual/en/domimplementation.createdocument.php ?
Wrikken
@Wrikken Doing `validate()` only gets me a couple of errors about the `html40/loose.dtd` and the same error as before. Using the explicit DTD declaration doesnt help either. Ive tried on an XP machine with 5.3.0 and libxml 20626 and nothing as well. I guess this is either a Windows thing or a libxml thing. I'll try to update it. Upvoted nonetheless though.
Gordon
@Gordon: OK, duly noted that this isn't cross-os/version behavior. Thankfully if works on my servers :) The XPath stays a failsafe fallback afaik.
Wrikken
@Wrikken after upgrading PHP to 5.3.3 which comes bundled with libxml 2.7.7, getElementById is working.
Gordon
OK, good news, nice to know live just got that little bit easier :)
Wrikken