tags:

views:

70

answers:

3

A friend has asked me this, and I couldn't answer.

He asked: I am making this site where you can archive your site...

It works like this, you enter your site like, something.com and then our site grabs the content on that website like images, and all that and uploads it to our site. Then people can view an exact copy of the site at oursite.com/something.com even if the server that is holding up something.com is down.

How could he do this? (php?) and what would be some requirements?

A: 

Use wget. Either the linux version or the windows version from the gnuwin32 package. get it here.

Femaref
I think using wget will only get you the html . If the site had another sources (pics , files..) you'll still reference the same (possibly unavailable) resources . If the point is to supply a temporary "fail over" site , he might need to download the resources too.
yossale
this is wrong, wget can create mirrors and will grab other resources as well. you have to set it up correctly, of course.
Femaref
A: 

It sounds like you need to create a webcrawler. Web crawlers can be written in any language, although I would recommend using C++ (using cURL), Java (using URLConnection), or Python (w/ urrlib2) for that. You could probably also hack something quickly together with the curl or wget commands and BASH, although that is probably not the best long-term solution. Also, don't forget that you should download, parse, and respect the "robots.txt" file if it is present whenever you crawl someone's website.

Michael Aaron Safyan
A: 
  1. Fetch the html using curl.
  2. Now change all the images,css,javascript to absolute url if they are relative urls. ( This is bit unethical). You can fetch all these assets and host on from your site.
  3. Respect "robots.txt" of all the files. read here.
Zimbabao