views:

385

answers:

2

I'd like to "grab" a few hundred urls from a few hundred html pages.

Pattern:

<h2><a href="http://www.the.url.might.be.long/urls.asp?urlid=1" target="_blank">The Website</a></h2>
+1  A: 
'/http:\/\/[^\/]+/[^.]+\.asp\?urlid=\d+/'

But better use HTML Parser, an example here with PHP Simple HTML DOM

$html = file_get_html('http://www.google.com/');

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
S.Mark
+3  A: 

Here is how to do it properly with the native DOM extensions

// GET file
$doc = new DOMDocument;
$doc->loadHtmlFile('http://example.com/');

// Run XPath to fetch all href attributes from a elements
$xpath = new DOMXPath($doc);
$links = $xpath->query('//a/@href');

// collect href attribute values from all DomAttr in array
$urls = array();
foreach($links as $link) {
    $urls[] = $link->value;
}
print_r($urls);

Note that the above will also find relative links. If you don't want those adjust the Xpath to

'//a/@href[starts-with(., "http")]'

Note that using Regex to match HTML is the road to madness. Regex matches string patterns and knows nothing about HTML elements and attributes. DOM does, which is why you should prefer it over Regex for every situation that goes beyond matching a supertrivial string pattern from Markup.

Gordon