I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.
Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.
I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.
Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.
In it's simplest form:
function crawl_page($url, $depth = 5) {
if($depth > 0) {
$html = file_get_contents($url);
preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);
foreach($matches[1] as $newurl) {
crawl_page($newurl, $depth - 1);
}
file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
}
}
crawl_page('http://www.domain.com/index.php', 5);
That function will get contents from a page, then crawl all found links and save the contents to 'results.txt'. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page.
Why use PHP for this, when you can use wget, e.g.
wget -r -l 1 http://www.example.com
As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.
Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html
Dumping results to a file: http://www.tizag.com/phpT/filewrite.php
Meh. Don't parse HTML with regexes.
Here's a DOM version inspired by Tatu's:
<?php
function crawl_page($url, $depth = 5)
{
static $seen = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
@$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
crawl_page($href, $depth - 1);
}
echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);
Edit: I fixed some bugs from Tatu's version (works with relative URLs now).
Edit: I added a new bit of functionality that prevents it from following the same URL twice.
Edit: echoing output to STDOUT now so you can redirect it to whatever file you want
Try php cod
<?php
$original_file = file_get_contents("http://www.domain.com");
$stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
//DEBUGGING
//$matches[0] now contains the complete A tags; ex: <a href="link">text</a> //$matches[1] now contains only the HREFs in the A tags; ex: link
header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
print_r($matches); //View the array to see if it worked?>
You would remove everything after //DEBUGGING when actually using it though. It's just for you to see how it works according to php tutorials if you put it in a PHP file by itself for testing.