views:

134

answers:

8

How do I download an HTML file from a URL in PHP, and download all of the dependencies like CSS and Images and store these to my server as files? Am I asking for too much?

A: 

Screen scraping might be your best answer here.

Chris
+1  A: 

You might take a look at the curl wrappers for PHP: http://us.php.net/manual/en/book.curl.php

As far as dependencies, you could probably get a lot of those using some regular expressions that look for things like <script src="...">, but a proper (X)HTML parser would let you more meaningfully traverse the DOM.

theraccoonbear
I wasn't your downvoter, but using regex's to parse html is asking for a world of hurt.
Byron Whitlock
Never use RegEx to parse HTML.Why: http://www.codinghorror.com/blog/archives/001311.html
Henri Watson
I'm fully aware of why it's a bad idea to "parse" (impossible) HTML w/ Regex. You should note that I specifically DID NOT use the word parse however, because I meant you could use regex to "look for things" not "parse".
theraccoonbear
+4  A: 

I would recommend using a html parsing library to simplify everything. Namely something like Simple HTML DOM.

Using Simple HTML DOM:

$html = file_get_html('http://www.google.com/');
foreach($html->find('img') as $element){
    //download image
}

For download files (and html) I would recommend using a HTTP wrapper such as curl, as it allows far more control over using file_get_contents. However, if you wanted to use file_get_contents, there are some good examples on the php site of how to get URLs.

The more complex method allows you to specify the headers, which could be useful if you wanted to set the User Agent. (If you are scraping other sites a lot, it is good to have a custom user agent as you can use it to let website admin your site or point of contact if you are using too much bandwidth, which is better than the admin blocking your IP address).

$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n"
  )
);

$context = stream_context_create($opts);
$file = file_get_contents('http://www.example.com/', false, $context);

Although of course it can be done simply by:

$file = file_get_contents('http://www.example.com/');
Yacoby
+1 This is how I would do it in pure php.
Byron Whitlock
+2  A: 

The library you want to look at is cURL with PHP. cURL performs actions pertaining to HTTP requests (and other networking protocols, but I'd bet HTTP is the most-used.) You can set HTTP cookies, along with GET/POST variables.

I'm not sure exactly if it will automatically download the dependencies - you might have to download the HTML, parse out the IMG/LINK tags, and then use cURL again to fetch those dependencies.

There are a bazillion tutorials out there on how to do this. Here's a simple example (scroll to the bottom) for a basic HTTP GET request from the people who make libcurl (upon which PHP's cURL bindings are based):

<?php
//
// A very simple example that gets a HTTP page.
//

$ch = curl_init();

curl_setopt ($ch, CURLOPT_URL, "http://www.zend.com/");
curl_setopt ($ch, CURLOPT_HEADER, 0);

curl_exec ($ch);

curl_close ($ch);
?>
rascher
A: 

What you would probably want to do is use SimpleXML to parse the HTML, and when you hit a

<img>

or

<script>

tag, read the SRC parameter and download that file.

Henri Watson
+7  A: 

The easiest way to do this would be to use wget. It can recursively download HTML and its dependencies. otherwise you will be parsing the html yourself. See Yacoby's answer for details on doing it in pure php.

Byron Whitlock
+1  A: 

Perls Mechanize does this very well. There is a library that does a similar task as mechanize but for PHP in the answer to this question:

http://stackoverflow.com/questions/199045/is-there-a-php-equivalent-of-perls-wwwmechanize

Tom J Nowell
+1  A: 

I think most of the options are covered in SO questions about PHP and screen scraping.

for example how to implement a web scraper in php or how do i implement a screen scraper in php

I realise you want more than just a screen scraper, but I think these questions will answer yours.

Matt Ellen