How do I download an HTML file from a URL in PHP, and download all of the dependencies like CSS and Images and store these to my server as files? Am I asking for too much?
You might take a look at the curl wrappers for PHP: http://us.php.net/manual/en/book.curl.php
As far as dependencies, you could probably get a lot of those using some regular expressions that look for things like <script src="...">
, but a proper (X)HTML parser would let you more meaningfully traverse the DOM.
I would recommend using a html parsing library to simplify everything. Namely something like Simple HTML DOM.
Using Simple HTML DOM:
$html = file_get_html('http://www.google.com/');
foreach($html->find('img') as $element){
//download image
}
For download files (and html) I would recommend using a HTTP wrapper such as curl, as it allows far more control over using file_get_contents. However, if you wanted to use file_get_contents, there are some good examples on the php site of how to get URLs.
The more complex method allows you to specify the headers, which could be useful if you wanted to set the User Agent. (If you are scraping other sites a lot, it is good to have a custom user agent as you can use it to let website admin your site or point of contact if you are using too much bandwidth, which is better than the admin blocking your IP address).
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en\r\n"
)
);
$context = stream_context_create($opts);
$file = file_get_contents('http://www.example.com/', false, $context);
Although of course it can be done simply by:
$file = file_get_contents('http://www.example.com/');
The library you want to look at is cURL with PHP. cURL performs actions pertaining to HTTP requests (and other networking protocols, but I'd bet HTTP is the most-used.) You can set HTTP cookies, along with GET/POST variables.
I'm not sure exactly if it will automatically download the dependencies - you might have to download the HTML, parse out the IMG/LINK tags, and then use cURL again to fetch those dependencies.
There are a bazillion tutorials out there on how to do this. Here's a simple example (scroll to the bottom) for a basic HTTP GET request from the people who make libcurl (upon which PHP's cURL bindings are based):
<?php
//
// A very simple example that gets a HTTP page.
//
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, "http://www.zend.com/");
curl_setopt ($ch, CURLOPT_HEADER, 0);
curl_exec ($ch);
curl_close ($ch);
?>
What you would probably want to do is use SimpleXML to parse the HTML, and when you hit a
<img>
or
<script>
tag, read the SRC parameter and download that file.
The easiest way to do this would be to use wget. It can recursively download HTML and its dependencies. otherwise you will be parsing the html yourself. See Yacoby's answer for details on doing it in pure php.
Perls Mechanize does this very well. There is a library that does a similar task as mechanize but for PHP in the answer to this question:
http://stackoverflow.com/questions/199045/is-there-a-php-equivalent-of-perls-wwwmechanize
I think most of the options are covered in SO questions about PHP and screen scraping.
for example how to implement a web scraper in php or how do i implement a screen scraper in php
I realise you want more than just a screen scraper, but I think these questions will answer yours.