views:

18

answers:

2

Is the HttpWebResponse.LastModified accurate? Is it always present? My project is to create a sort of a focused web crawler and I am stucked if I will use the hash value of a resource or just the HttpWebResponse.LastModified property to check the resource's "freshness".

Using the hash value means streaming the resource every time it's checked. This has a big impact on overall performance.

If I will just check the HttpWebResponse.LastModified, is it accurate?

+2  A: 

HttpWebResponse.LastModified returns the value of the HTTP Last-Modified response header.

HTTP response headers are set by the HTTP server sending the response. It's completely up to the server if it sets the Last-Modified response header, and whether it sets it to an accurate value or not.

The Last-Modified response header is part of the Validation Model for Caching in HTTP. It is usually used in conjunction with the If-Modified-Since request header. You might want to read HTTP/1.1, part 6: Caching for the details.

dtb
So do you think I will have to get the hash value of the resource? Do you know of any other way?
Jojo
BTW, thanks @dtb. So, do you know of any other way?
Jojo
@Jojo: Please read [HTTP/1.1, part 6: Caching](http://tools.ietf.org/html/draft-ietf-httpbis-p6-cache-11). It's really easy to read. You are interested in the Validation Model and the Freshness Model parts.
dtb
There's also the ETag, which I'd prefer over Last-Modified when present.
Julian Reschke
A: 

It depends on your purpose.

Last-Modified will mean that the server is happy for you to keep using an entity that had the same last-modified value (or later by implication, though it would be strange for the server's last-modified to ever go back, but could happen in some multi-server situations).

E-tag is stronger (all the more if it's not a "weak" e-tag) in that it identifies the specific entity (e-tags for different language versions, different content-type versions, or different content-encoding versions will differ unless they are actually the same entity [which can happen, in a restricted set of circumstances]).

Both can be "loose" in terms of perhaps a server change is considered insignificant; the server is happy for you to keep using the previous entity because it considers it the same (except "strong" e-tags, which must indicate octet-to-octet identity for use with range requests).

Both can of course just be plain wrong. Bugs happen. That said, when they are wrong its more often in the other direction, reporting a change when none has happened (a valid behaviour, one is allowed to be over-cautious about freshness; it never damages only makes sub-optimal).

The question then, is whether you need to know that the server considers no change to have been made (most usage) or there really has been a change (pretty much restricted to diagnostic tools).

Unless you've a clear reason not to, trust last-modified and e-tag (but trust e-tag more).

Jon Hanna
Hi Jon, for example the resource is a PDF file. Is the actual last modified date of the PDF file the same as the response's last modified date?
Jojo
Well, the resource is not the file but "e.g. the current documentation for our service" or whatever the PDF is for. The entity is the file (what's actually sent) and there can be more than one entity per resource (different languages, different content types, different compression types). For each of those, you would almost always have the last modification time of the entity the same as the file. There are relatively obscure cases where you might do things differently, but basing if the entity is based on a file, you'd almost always do it that way.
Jon Hanna