tags:

views:

734

answers:

6

I have a web page with a bunch of links. I want to write a script which would dump all the data contained in those links in a local file.

Has anybody done that with PHP? General guidelines and gotchas would suffice as an answer.

+2  A: 

Check out PHP Crawler

http://sourceforge.net/projects/php-crawler/

See if it helps.

GeekTantra
+1  A: 

In it's simplest form:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

That function will get contents from a page, then crawl all found links and save the contents to 'results.txt'. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page.

Tatu Ulmanen
Hey, Thanks :-)
Crimson
-1: Meh to using regexes. Doesn't work with relative urls. Also uses the wrong URL in the file_put_contents().
hobodave
@hobodave, all true.
Tatu Ulmanen
+2  A: 

Why use PHP for this, when you can use wget, e.g.

wget -r -l 1 http://www.example.com
Gordon
Some specific fields have to be parsed and taken out. I will need to write code.
Crimson
@Crimson that's a requirement you should note in the question then ;)
Gordon
@Gordon: "How do I make a simple crawler in PHP?" :-P
hobodave
@hobodave I meant the part about *having to parse and take out specific fields* :P If it wasn't for this, using wget is the simplest thing I could imagine for this purpose.
Gordon
+1  A: 

As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.

Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html

Dumping results to a file: http://www.tizag.com/phpT/filewrite.php

Jens Roland
+3  A: 

Meh. Don't parse HTML with regexes.

Here's a DOM version inspired by Tatu's:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit: I fixed some bugs from Tatu's version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

hobodave
Thanks for that advice. Very Helpful :)
Crimson
Can I recommend using curl to fetch the page then manipulate/traverse using the DOM library. If you're doing this frequently curl is much better option imo.
Ben Shelock
@Ben: Why is it better?
hobodave
A: 

Try php cod

<?php
$original_file = file_get_contents("http://www.domain.com");
$stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
//DEBUGGING
//$matches[0] now contains the complete A tags; ex: <a href="link">text</a> //$matches[1] now contains only the HREFs in the A tags; ex: link
header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read!
print_r($matches); //View the array to see if it worked?>

You would remove everything after //DEBUGGING when actually using it though. It's just for you to see how it works according to php tutorials if you put it in a PHP file by itself for testing.

This adds nothing that Tatu hasn't said, but is worse.
hobodave