views:

1165

answers:

1

Working with PHP Xpath trying to quickly pull certain links within a html page.

The following will find all href links on mypage.html: $nodes = $x->query("//a[@href]");

Whereas the following will find all href links where the description matches my needle: $nodes = $x->query("//a[contains(@href,'click me')]");

What I am trying to achieve is matching on the href itself, more specific finding url's that contain certain parameters. Is that possible within a Xpath query or should I just start manipulating the output from the first Xpath query?

+3  A: 

Not sure I understand the question correctly, but the second XPath expression already does what you are describing. It does not match against the text node of the A element, but the href attribute:

$html = <<< HTML
<ul>
    <li>
        <a href="http://example.com/page?foo=bar"&gt;Description&lt;/a&gt;
    </li>
    <li>
        <a href="http://example.com/page?lang=de"&gt;Description&lt;/a&gt;
    </li>
</ul>
HTML;

$xml  = simplexml_load_string($html);
$list = $xml->xpath("//a[contains(@href,'foo')]");

Outputs:

array(1) {
  [0]=>
  object(SimpleXMLElement)#2 (2) {
    ["@attributes"]=>
    array(1) {
      ["href"]=>
      string(31) "http://example.com/page?foo=bar"
    }
    [0]=>
    string(11) "Description"
  }
}

As you can see, the returned NodeList contains only the A element with href containing foo (which I understand is what you are looking for). It contans the entire element, because the XPath translates to Fetch all A elements with href attribute containing foo. You would then access the attribute with

echo $list[0]['href'] // gives "http://example.com/page?foo=bar"

If you only want to return the attribute itself, you'd have to do

//a[contains(@href,'foo')]/@href

Note that in SimpleXml, this would return a SimpleXml element though:

array(1) {
  [0]=>
  object(SimpleXMLElement)#3 (1) {
    ["@attributes"]=>
    array(1) {
      ["href"]=>
      string(31) "http://example.com/page?foo=bar"
    }
  }
}

but you can output the URL now by

echo $list[0] // gives "http://example.com/page?foo=bar"
Gordon
that is what I meant. Only my html document fails when using SimpleXML. The xpath query works though, and using it with DomXpath gives me what I want. Thanks!
Matt
@Matt SimpleXml and DOM choke on malformed HTML. Consider either Tidy or HTMLPurifier to repair the HTML before parsing it or try to use SimpleHTML. The latter doesn't support XPath though.
Gordon