tags:

views:

32

answers:

2

Is there a standard way to tell when a page was last modified? Currently I am doing this:

URLConnection uCon = url.openConnection();
uCon.setConnectTimeout(5000);   // 5 seconds
String lastMod = uCon.getHeaderField("Last-Modified");
System.out.println("last mod: "+lastMod);

However it looks like some sites do not have a Last-Modified field.

http://www.cbc.ca has these header fields:

X-Origin-Server
Connection
Expires
null
Date
Server
Content-Type
Transfer-Encoding
Cache-Control

I could parse a page to try and get its date but this seems like a major pain. What is the standard?

(If possible I would like to stick with using URLConnection because that is what I use to download the webpage)

+5  A: 

There is no standard. Dynamically generated web pages generally do not have a Last-Modified field, and different web pages include dates in different ways. Some sites do not even include such a date, including "© <current year>" at the bottom. You could try looking for a date near the bottom or the top, but reliably extracting the date from the web page would have to be site-specific.

idealmachine
huh, that is what I thought too. Thanks!
sixtyfootersdude
+1  A: 

From HTTP/1.1: Header Field Definitions:

14.29 Last-Modified

The Last-Modified entity-header field indicates the date and time at which the origin server believes the variant was last modified.

   Last-Modified  = "Last-Modified" ":" HTTP-date

An example of its use is

   Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT

The exact meaning of this header field depends on the implementation of the origin server and the nature of the original resource. For files, it may be just the file system last-modified time. For entities with dynamically included parts, it may be the most recent of the set of last-modify times for its component parts. For database gateways, it may be the last-update time stamp of the record. For virtual objects, it may be the last time the internal state changed.

An origin server MUST NOT send a Last-Modified date which is later than the server's time of message origination. In such cases, where the resource's last modification would indicate some time in the future, the server MUST replace that date with the message origination date.

An origin server SHOULD obtain the Last-Modified value of the entity as close as possible to the time that it generates the Date value of its response. This allows a recipient to make an accurate assessment of the entity's modification time, especially if the entity changes near the time that the response is generated.

HTTP/1.1 servers SHOULD send Last-Modified whenever feasible.

From this point Last-modified is optional and its value depends of the nature of the original resource.

Michael Konietzka
Thanks, very helpful!
sixtyfootersdude