views:

76

answers:

4

I want to create a PHP function that goes through a website's homepage, finds all the links in the homepage, goes through the links that it finds and keeps going until all the links on said website are final. I really need to build something like this so I can spider my network of sites and supply a "one stop" for searching.

Here's what I got so far -

function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
    if(empty($current_array)) {
        // Make the request to the original URL
        $session = curl_init($urltospider);
        curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
        $html = curl_exec($session);
        curl_close($session);
        if($html != '') {
            $dom = new DOMDocument();
            @$dom->loadHTML($html);
            $xpath = new DOMXPath($dom);
            $hrefs = $xpath->evaluate("/html/body//a");
            for($i = 0; $i < $hrefs->length; $i++) {
                $href = $hrefs->item($i);
                $url = $href->getAttribute('href');
                if(!in_array($url, $ignore_array) && !in_array($url, $current_array)) {
                    // Add this URL to the current spider array
                    $current_array[] = $url;
                }
            }               
        } else {
            die('Failed connection to the URL');
        }
    } else {
        // There are already URLs in the current array
        foreach($current_array as $url) {
            // Connect to this URL

            // Find all the links in this URL

            // Go through each URL and get more links
        }
    }
}

The only problem is, I can't seem to get my head around how to proceed. Can anyone help me out? Basically, this function will repeat itself until everything has been found.

+3  A: 

I'm not PHP expert, but you seem to be over-complicating it.

function spider($urltospider, $current_array = array(), $ignore_array = array('')) {
    if(empty($current_array)) {
        $current_array[] =  $urltospider;
    $cur_crawl = 0;
    while ($cur_crawl < len($current_array)) { //don't use foreach because that can get messed up if you change the array while inside the loop.
        $links_found = crawl($current_array($cur_crawl)); //crawl should return all links found in the given page
        //Now keep adding $links_found to $current_array. Maybe you can check if any of the links found are already in $current_array so you don't crawl them multiple times
        $current_array = array_merge($current_array, $links_found);
        $cur_crawl += 1;
    }
return $current_array;
}
Umang
Thank you so much!
Raphael Caixeta
Glad I could help.
Umang
+2  A: 

The word you're looking for is recursion. In the foreach loop, you would simply call spider again, and it will enter the function for each URL and do the spidering recursively.

There's a rather significant problem though - you have no base case, unless you eventually reach pages that have no links to other pages (dead ends). This function will run forever and not terminate. You need to bound it in a couple ways.

  1. Use memoization to remember the results from URLs you've already seen, rather than requesting the same page over and over.

  2. Restrict the URLs you'll visit to a particular domain, i.e. starting with 'http://www.somedomain.com' so you don't end up spidering the entire Internet.

Mike Mueller
+2  A: 

What you (probably) want to use is called "recursion".

Web pages are graphs. There are several algorithms for transversal of graphs; the simplest to understand is depth-first.

Say your site is laid out like this (recursion terminated):

* http://example.com/
  * http://example.com/
    * ...
  * http://example.com/post/1/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/archives/
      * ...
  * http://example.com/post/2/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/archives/
      * ...
  * http://example.com/post/3/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/archives/
      * ...
  * http://example.com/about/
    * http://example.com/
      * ...
    * http://example.com/archives/
  * http://example.com/archives/
    * http://example.com/
      * ...
    * http://example.com/about/
      * ...
    * http://example.com/post/1/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/2/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/3/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/4/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...
    * http://example.com/post/5/
      * http://example.com/
        * ...
      * http://example.com/about/
        * ...
      * http://example.com/archives/
        * ...

When you first hit http://example.com/, you have the following links:

You need to keep track of pages you have already visited, so you can ignore them. (Otherwise, it'd take forever to spider a page ... literally.) You add to the ignore list every time you visit a page. Right now, the only entry in the ignore list is http://example.com/.

Next, you filter out the ignored links, reducing the list to:

You then run the fetcher again on each of these links. You do this by calling your function again, with the current URL and ignore list: spider($url, &$ignoredUrls) (We use a reference to $ignoredUrls so newly ignored items are visible to the parent spider calls.)

Looking at http://example.com/post/1/, we see the following links:

We have already looked at http://example.com/. The next link which isn't ignored is the about page. From the about page, we go to the archives page, where we look through each post. Each post has the same set of links:

Because we already visited all of those links, we return an empty array.

Back at /archives/, we append the /post/2/ link (the first non-ignored link in /archives/) to a $foundLinks local variable, as well as the return value of the call to spider on with /post/2/ (which is an empty array). Then we move on to the second post.

When we go through all our posts, we return $foundLinks. The /about/ page then adds those links to its own $foundLinks, in addition to the /about/ link. Flow goes back to /post/1/, which looks at /archives/ (which is now ignored). The /posts/1/ spider is now complete, and returns its own $foundLinks. Eventually, the original call gets all the found links.


This method works fine for a small site which is completely closed. If you link to Wikipedia, though, you'll be spidering all day long. You can combat this problem in at least two ways:

  1. Terminating spidering after a certain depth (e.g. 10 links deep).
  2. Restricting the URL's, e.g. to a certain domain or subdomain (like example.com).

Here's a quick implementation of spider (untested):

function get_urls($url) {
    // curl/DOM code here
}

define('SPIDER_MAX_DEPTH', 10);

function spider_internal($url, &$ignoredUrls, $depth = 0) {
    $foundUrls = array($url);

    $ignoredUrls[] = $foundUrls;

    if($depth >= SPIDER_MAX_DEPTH) {
        return $foundUrls;
    }

    $links = get_links($url);

    foreach($links as $link) {
        if(array_search($link, $ignoredUrls) !== false) {
            continue;
        }

        $foundUrls = array_merge($foundUrls, spider($link, $ignoredUrls, $depth + 1));
    }

    return $foundUrls;
}

function spider($url) {
    $ignoredUrls = array();

    return spider_internal($url, $ignoredUrls);
}
strager
A: 

You definintely do NOT want to use recursion when crawling the web. :)

Works well for small sites, will consume all available RAM on large sites. e.g. Do you have enough RAM to crawl (and store the string reference) for every link on msn.com? Likely no.

arachnode.net