ansaurus

Question

How do I make a simple crawler in PHP?

Answer 1

+2 A:

Check out PHP Crawler

http://sourceforge.net/projects/php-crawler/

See if it helps.

GeekTantra 2010-02-22 18:26:35

Answer 2

+1 A:

In it's simplest form:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

That function will get contents from a page, then crawl all found links and save the contents to 'results.txt'. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page.

Tatu Ulmanen 2010-02-22 18:29:07

Hey, Thanks :-)

Crimson 2010-02-22 18:36:39

-1: Meh to using regexes. Doesn't work with relative urls. Also uses the wrong URL in the file_put_contents().

hobodave 2010-02-22 19:01:11

@hobodave, all true.

Tatu Ulmanen 2010-02-22 19:29:24

Answer 3

+2 A:

Why use PHP for this, when you can use wget, e.g.

wget -r -l 1 http://www.example.com

Gordon 2010-02-22 18:33:07

Some specific fields have to be parsed and taken out. I will need to write code.

Crimson 2010-02-22 18:35:18

@Crimson that's a requirement you should note in the question then ;)

Gordon 2010-02-22 18:38:12

@Gordon: "How do I make a simple crawler in PHP?" :-P

hobodave 2010-02-22 18:53:07

@hobodave I meant the part about *having to parse and take out specific fields* :P If it wasn't for this, using wget is the simplest thing I could imagine for this purpose.

Gordon 2010-02-22 19:15:09

Answer 4

+1 A:

As mentioned, there are crawler frameworks all ready for customizing out there, but if what you're doing is as simple as you mentioned, you could make it from scratch pretty easily.

Scraping the links: http://www.phpro.org/examples/Get-Links-With-DOM.html

Dumping results to a file: http://www.tizag.com/phpT/filewrite.php

Jens Roland 2010-02-22 18:39:45

Answer 5

+3 A:

Meh. Don't parse HTML with regexes.

Here's a DOM version inspired by Tatu's:

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

Edit: I fixed some bugs from Tatu's version (works with relative URLs now).

Edit: I added a new bit of functionality that prevents it from following the same URL twice.

Edit: echoing output to STDOUT now so you can redirect it to whatever file you want

hobodave 2010-02-22 18:47:18

Thanks for that advice. Very Helpful :)

Crimson 2010-02-22 18:59:36

Can I recommend using curl to fetch the page then manipulate/traverse using the DOM library. If you're doing this frequently curl is much better option imo.

Ben Shelock 2010-03-18 16:46:31

@Ben: Why is it better?

hobodave 2010-03-19 22:51:44

Answer 6

A:

Try php cod

<?php $original_file = file_get_contents("http://www.domain.com"); $stripped_file = strip_tags($original_file, "<a>"); preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches); //DEBUGGING //$matches[0] now contains the complete A tags; ex: <a href="link">text</a> //$matches[1] now contains only the HREFs in the A tags; ex: link header("Content-type: text/plain"); //Set the content type to plain text so the print below is easy to read! print_r($matches); //View the array to see if it worked?>

You would remove everything after //DEBUGGING when actually using it though. It's just for you to see how it works according to php tutorials if you put it in a PHP file by itself for testing.

2010-02-22 19:43:05

This adds nothing that Tatu hasn't said, but is worse.

hobodave 2010-02-22 19:47:00

ansaurus

tags:

views:

answers:

How do I make a simple crawler in PHP?

related questions