ansaurus

Question

Basic web-crawling question: How to create a list of all pages on a website using php?

Answer 1

+2 A:

For the general approach, check out the answers to these questions:

In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).

Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.

Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:

$dom = new DOMDocument();
$dom->loadHTML($content);

$anchors = $dom->getElementsByTagName('a');

if ( count($anchors->length) > 0 ) {
    foreach ( $anchors as $anchor ) {
        if ( $anchor->hasAttribute('href') ) {
            $url = $anchor->getAttribute('href');

            //now figure out whether to processs this
            //URL and add it to a list of URLs to be fetched
        }
    }
}

Finally, rather than write it yourself, see also this question for other resources you could use.

is there a good web crawler library available for PHP or Ruby?

Paul Dixon 2009-09-27 08:06:04

ansaurus

tags:

views:

answers:

Basic web-crawling question: How to create a list of all pages on a website using php?

related questions