tags:

views:

34

answers:

2

Why does the following not work?:

$dom = new DOMDocument();
@$dom->load('http://tinyurl.com/35cs96n');
$xpath = new DOMXPath($dom);

$entries = $xpath->query('//table[@id="SubCategory_SubCategoryDataList"]/a/@href');

foreach ($entries as $entry) {
    echo $entry->nodeValue.'<br>';
}
+3  A: 

Isn't it supposed to be //table[@id="SubCategory_SubCategoryDataList"]//a/@href

(Notice the two slashes before the a, since you're not looking at direct children)

slhck
+3  A: 

If your code contains an error suppression operator (@), the first thing to do is to remove it to see if it actually supressed errors. In your case, it did. A lot. So many actually that DOM couldnt load the content (at least it wouldnt show any when I tried to outoput the file with saveXML()). The correct way to load broken HTML with DOM is to use:

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument();
$dom->loadHTMLFile('http://tinyurl.com/35cs96n');
libxml_clear_errors();

Loading the page with loadHTMLFile will make DOM use the HTMLParser module which is much more forgiving about broken markup. And the libxml function calls will keep the errors away from you.

As for the XPath, try @slhck's suggestion. The a elements are not direct children of the table. There is tr and td elements inbetween. If you look at the HTML, you will see that the a elements will all have ids derived from the table id themselves, so you could query them directly with

 '//a[contains(@id, "SubCategory_SubCategoryDataList")]/@href'
Gordon
Nice addition, thanks!
slhck