views:

265

answers:

3

I once ran a home-made forum system for a small group of online deathmatch players. These forums have long since been shut down, and are currently offline. What I want to do is create static HTML files that contain all of the data for the entire system, in the interest of having an online archive that the former users could search.

I have control over all of the original data. This would include:

  • Images
  • Profiles
  • Forum Threads
  • Database
  • PHP Scripts

Basically, I want to take the database out of the equation, so that I don't have to waste the resources to keep it alive. (and also because this was a home-made forum solution, I'm sure it's not very optimized)

Is this a feasible goal, or should I just keep the forums the way they are, and not worry about the overhead?

If it is possible (and remotely feasible), can I get a few suggestions about how to proceed?

+4  A: 

wget can create an HTML mirror of a website. Look in the docs for usage of --mirror.

ceejayoz
This appears to be the answer I was looking for. 'wget -Em url' has done the trick! Thank you very much.
Strozykowski
A: 

What ceejayoz said Or, you can add a caching headers to the bootstrap of your application, a cache for as many years as you think you want it.

You put a call to the attached function with number of hours you want to clientside cache the page, be sure to call this function after session_start if you have it, as session_start emits headers that prevents caching.

function client_side_cache($hours)

{
     //in the event a session start is used, I have to clean all the #$%# headers it sends to prevent caching
     header('Cache-Control: ',true);
     header("Pragma: ", true);
     header("Expires: ", true);

     //get the If-Modified-Since header in a unix time format
     $headers = getallheaders();
     if (isset($headers['If-Modified-Since']))
     {
      $modifiedSince = explode(';', $headers['If-Modified-Since']);
      $modifiedSince = strtotime($modifiedSince[0]);
     }
     else
     {
      $modifiedSince = 0;
     }

     //calculate the Last-Modified timestamp
     $current_time=time();
     $last_modified=($current_time)/($hours*3600);
     $last_modified=(int)$last_modified;
     $last_modified=$last_modified*$hours*3600;

 //check cache not expires
     if ($last_modified <= $modifiedSince)
     {
      header('HTTP/1.1 304 Not Modified');
      exit();
     }

     else //emit a new Last-Modified (either cache expired or page wasn'r cached
     {
      Header('Last-Modified: '.gmdate("D, d M Y H:i:s",$last_modified).' GMT ');
     }

}
Itay Moav
Before posting on Stackoverflow, remove extra spaces from your source code that are not required.
X-Istence
2000 is just around the corner ;-)
Itay Moav
+1  A: 

Use output buffering to capture all your output and write it to a file instead of out to the browser.

Edit your code so that at the top (before any HTML output to the browser), you have this line:

ob_start();

at the end of the script, add this:

$output = ob_get_clean();
file_put_contents("<<name of this page>>.html", $output);

You'd have to come up with some naming scheme so you don't get duplicates.

Then, using a regex, use apache RewriteRules to redirect all requests to the new HTML page so your links don't break.

To get all the pages, you could just click through each of them one by one if you don't have many, you could write all the URLs into an array manually and then loop through them, or you could even crawl the site yourself, looking for all URLs on a page and adding them into the queue as you go.

nickf