views:

263

answers:

4

I recently installed the add-on "DownThemAll" into my firefox and as I watched it download a huge amount of pk3 files(map files for an opensource First Person Shooter), I wondered if I could do the same with PHP.

Here's what I'm thinking:

foreach(glob("http://www.someaddress.ext/path/*.pk3") as $link) {
  //do something to download...
}

Yeah that's about as far as I've gotten. I'm wondering wheter to just initiate a download of to do it via a stream... I don't really know my way around this material, it's not what I usually do with PHP, but it's triggered my interest.

So is there anybody who knows how to tackle this problem?

A: 

That's not a trivial problem. But if you have access to the "system" command, you can use wget to accomplish this task. It offers recursive downloading functions to follow links throughout the pages, and you can control the depth it should follow links and much more. It also supports authentication, and several protocols including http and ftp.

Wadih M.
+3  A: 

I'll throw you in the right direction.

cURL for the downloading and a regular expression to get all the paths in the link.

Beware though, a link on a site can be a relative link. So you need to check for that.

Ólafur Waage
"a link on a site can be a relative link." The realpath() function should solve this I think. Yes?
Vordreller
No, since the path is a remote HTTP path and realpath() shows you your local path.
Ólafur Waage
A: 

From php fread docs:

// For PHP 5 and up
$handle = fopen("http://www.example.com/", "rb");
$contents = stream_get_contents($handle);
fclose($handle);

or you could just use:

$aaa = file_get_contents('http://www.example.com/');

So:

  1. Download page which contains list of links
  2. Parse that list for links (using regex)
  3. Download and write (fwrite) content of each link to HDD.

Tip: Check php documentation for each of those functions, there are quite a lot nice examples.

Maiku Mori
A: 

This will do it (or help at least):

$pageRaw = fread_url('www.example.com');

//link extraction regex        
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
                "(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
                $pageRaw, &$matches);

$matches = $matches[1];

foreach($matches as $link)
{    
    echo $link. '<br />';
}

//falls back to fopen if curl is not there
function fread_url($url,$ref="")
{
    if(function_exists("curl_init")){
        $ch = curl_init();
        $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; ".
                      "Windows NT 5.0)";
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
        curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
        curl_setopt( $ch, CURLOPT_FOLLOWLOCATION , 1 );
        curl_setopt( $ch, CURLOPT_URL, $url );
        curl_setopt( $ch, CURLOPT_REFERER, $ref );
        curl_setopt ($ch, CURLOPT_COOKIEJAR, 'cookie.txt');
        $html = curl_exec($ch);
        curl_close($ch);
    }
    else{
        $hfile = fopen($url,"r");
        if($hfile){
            while(!feof($hfile)){
                $html.=fgets($hfile,1024);
            }
        }
    }
    return $html;
}
karim79