views:

120

answers:

1

Hay guys i need help on a regex.

I'm using file_get_contents() to get the source of a page, i want to then loop through the source and find all the and extract all the HREF values into an array.

Thanks

+1  A: 

You should better use a real parser like SimpleXML or DOMDocument than regular expressions. Here’s an example with DOMDocument that will give you an array of A elements:

$doc = new DOMDocument();
$doc->loadHTML($str);
$aElements = $doc->getElementsByTagName("a");
foreach ($aElements as $aElement) {
    if ($aElement->hasAttribute("href")) {
        // link; use $aElement->getAttribute("href") to retrieve the value
    } else {
        // not a link
    }
}
Gumbo
Shall i assume that $str is the returned value from file_get_contents() ?
dotty
@dotty: Yes, `$str` is the string with the HTML source code.
Gumbo
well i used your code but it through up a load of errors about unformatted tags and such. So i did some digging and found a regexpreg_match_all("/href=\"(.*?)\"/", $html, $aElements);How would i use this to only find http sources?
dotty
I wouldn’t use regular expressions. Because HTML is not a regular language. By the way: Did you try to disable *strictErrorChecking* (see http://docs.php.net/manual/en/class.domdocument.php#domdocument.props.stricterrorchecking)?
Gumbo