you can request the http header to check if a web page has been edited by looking at its date but how about dynamic pages such as - php, aspx- which grabs its data from a database?
You can if it uses the http response headers correctly but it's often overlooked.
Otherwise storing a local md5-hash of the content might be useful to you (unless there's an easier in-content string you could hook out). It's not ideal (because it's quite a slow process) but it's an option.
Yes, you can and should use HTTP headers to mark pages as unexpired. If they are dynamic though (PHP, ASPX, etc.) and/or database driven, you'll need to manually control setting the Expires header/sending HTTP Not Modified appropriately. ASP.NET has some SqlDependency objects for this, but they still need to be configured and managed. (Not sure if PHP has something just like it, but there's probably something in PEAR if not...)
This is the exact purpose of the ETag header, but it has to be supported by your web framework or you need to take care that your application responds properly to requests with headers If-Match, If-Not-Match and If-Range (see HTTP Ch 3.11).
The Last-Modified
header will only be of use to you if the programmer of the site has explicitly set it to be returned.
For a regular, static page Last-Modified
is the timestamp of the last modification of the HTML file. For a dynamically generated page the server can't reliably assign a Last-Modified
value as it has no real way of knowing how the content has changed depending on request, so many servers don't generate the header at all.
If you have control over the page, then ensuring the Last Modified header is being set will ensure a check on Last-Modified
is successful. Otherwise you may have to fetch the page and either perform a regex to find a changed section (e.g. date/time in the header of a news site). If no such obvious marker exists, then I'd second Oli's suggestion of an MD5 on the page content as a way to be sure it has changed.
Even though you might think it's outdated I've always found Simon Willison's article on Conditional GET to be more than useful. The example is in PHP but it is so simple that you can adapt it to other languages. Here it is the example:
function doConditionalGet($timestamp) {
// A PHP implementation of conditional get, see
// http://fishbowl.pastiche.org/archives/001132.html
$last_modified = substr(date('r', $timestamp), 0, -5).'GMT';
$etag = '"'.md5($last_modified).'"';
// Send the headers
header("Last-Modified: $last_modified");
header("ETag: $etag");
// See if the client has provided the required headers
$if_modified_since = isset($_SERVER['HTTP_IF_MODIFIED_SINCE']) ?
stripslashes($_SERVER['HTTP_IF_MODIFIED_SINCE']) :
false;
$if_none_match = isset($_SERVER['HTTP_IF_NONE_MATCH']) ?
stripslashes($_SERVER['HTTP_IF_NONE_MATCH']) :
false;
if (!$if_modified_since && !$if_none_match) {
return;
}
// At least one of the headers is there - check them
if ($if_none_match && $if_none_match != $etag) {
return; // etag is there but doesn't match
}
if ($if_modified_since && $if_modified_since != $last_modified) {
return; // if-modified-since is there but doesn't match
}
// Nothing has changed since their last request - serve a 304 and exit
header('HTTP/1.0 304 Not Modified');
exit;
}
With this you can use HTTP verbs GET or HEAD (I think it's also possible with the others, but I can't see the reason to use them). All you need to do is adding either If-Modified-Since
or If-None-Match
with the respective values of headers Last-Modified
or ETag
sent by a previous version of the page. As of HTTP version 1.1 it's recommended ETag
over Last-Modified
, but both will do the work.
This is a very simple example of how a conditional GET works. First we need to retrieve the page the usual way:
GET /some-page.html HTTP/1.1 Host: example.org
First response with conditional headers and contents:
200 OK ETag: YourETagHere
Now the conditional get request:
GET /some-page.html HTTP/1.1 Host: example.org If-None-Match: YourETagHere
And the response indicating you can use the cached version of the page, as only the headers are going to be delivered:
304 Not Modified ETag: YourETagHere
With this the server notified you there was no modification to the page.
I can also recommend you another article about conditional GET: HTTP conditional GET for RSS hackers.
I don't know a whole lot about meta tags and was asked to research how to bring back a last modified date for an html page that has been dynamically generated by a content management system. In layman's terms, can you explain exactly what I would put in my header??