ansaurus

Question

Creating a PHP file that downloads all links from a certain site

Answer 1

A:

That's not a trivial problem. But if you have access to the "system" command, you can use wget to accomplish this task. It offers recursive downloading functions to follow links throughout the pages, and you can control the depth it should follow links and much more. It also supports authentication, and several protocols including http and ftp.

Wadih M. 2009-03-18 18:54:48

Answer 2

+3 A:

I'll throw you in the right direction.

cURL for the downloading and a regular expression to get all the paths in the link.

Beware though, a link on a site can be a relative link. So you need to check for that.

Ólafur Waage 2009-03-18 18:55:01

"a link on a site can be a relative link." The realpath() function should solve this I think. Yes?

Vordreller 2009-03-18 20:38:31

No, since the path is a remote HTTP path and realpath() shows you your local path.

Ólafur Waage 2009-03-18 20:44:42

Answer 3

A:

From php fread docs:

// For PHP 5 and up
$handle = fopen("http://www.example.com/", "rb");
$contents = stream_get_contents($handle);
fclose($handle);

or you could just use:

$aaa = file_get_contents('http://www.example.com/');

So:

Download page which contains list of links
Parse that list for links (using regex)
Download and write (fwrite) content of each link to HDD.

Tip: Check php documentation for each of those functions, there are quite a lot nice examples.

Maiku Mori 2009-03-18 19:31:59

Answer 4

A:

This will do it (or help at least):

$pageRaw = fread_url('www.example.com');

//link extraction regex        
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
                "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
                $pageRaw, &$matches);

$matches = $matches[1];

foreach($matches as $link)
{    
    echo $link. '<br />';
}

//falls back to fopen if curl is not there
function fread_url($url,$ref="")
{
    if(function_exists("curl_init")){
        $ch = curl_init();
        $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; ".
                      "Windows NT 5.0)";
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
        curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
        curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
        curl_setopt( $ch, CURLOPT_URL, $url );
        curl_setopt( $ch, CURLOPT_REFERER, $ref );
        curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
        $html = curl_exec($ch);
        curl_close($ch);
    }
    else{
        $hfile = fopen($url,"r");
        if($hfile){
            while(!feof($hfile)){
                $html.=fgets($hfile,1024);
            }
        }
    }
    return $html;
}

karim79 2009-03-18 21:50:50

ansaurus

tags:

views:

answers:

Creating a PHP file that downloads all links from a certain site

related questions