tags:

views:

21

answers:

2

I'm trying to extract all relevant URLs and images out of a page and put them into an array, the code below works fine except it outputs the first pair over and over for the numerically-correct number of times. I thought maybe I was making mistakes when specifying XPATHs but I've tested it on 3 different sites with the same result every time.

$dom = new DOMDocument();
$dom->loadHtml( $html );
$xpath = new DOMXPath( $dom );

$items = $xpath->query( "//div[@class=\"row\"]" );

foreach ( $items as $item ) {

$value['url'] = $xpath->query( "//div[@class=\"productImg\"]/a/@href",$item)->item(0)->nodeValue;

$value['img'] = $xpath->query("//div[@class=\"productImg\"]/a/img/@src",$item)->item(0)->nodeValue;

$result[] = $value;


}

print_r($result);

Clearly the code isn't right but I haven't been able to narrow it down to the offending portion. And before somebody suggests using regex that is something I'd usually do but I'd prefer to use XPATH now if possible.

A: 

There are too many assumptions about what your HTML looks like, but, one problem I can spot right off the bat is the ->item(0) portion. That 0 needs to reflect the iteration in question.

Assuming that $items will always have numerical keys:

foreach( $items as $key => $item ) {
 ..... item)->item($key)->nodeValue;
}
pp19dd
A: 

Given query("//div[@class=\"productImg\"]/a/img/@src",$item) it looks like you're wanting to perform a query relative to $item. You're very nearly there, just not quite.

Your query starts with //div which means to look for any <div> nodes which are descendants of the document root and satisfy the remaining portion of the query. The key place where you're falling over is that this expression is, as mentioned, from the document root.

In order to select the context node, you should start the expression with . such that .//div would match any <div> nodes which are descendant from the context node (i.e. your $item).

salathe
You are correct, thanks!
Clarissa