What's a good method of programatically generating etag for web pages, and is this practice recommended? Some sites recommend turning etags off, others recommend producing them manually, and some recommend leaving the default settings active - what's the best way here?
Well ETags make sense when you rely heavily on caching. They are a great indicator for the state of a resource (e.g. a URL).
For example, let's say you use an ajax request to pull the latest comments of a user and you want to know if there are any new comments. Changing the ETag to alert your application of new content is a less expensive way to check on that.
Because if the ETag is the same, you can keep your cache, but otherwise rebuild it.
ETags also make a lot of sense with RESTful APIs.
As for generating it, looking at the spec, I think you can do almost anything you want. A timestamp, a hash, whatever makes sense to you/your application.
I recommend generating a hash of the the content, e.g. md5($content)
.
Additionally, to prevent hash collision, you might want to add e.g. the ID of the content element to it (if this is appropriate).
ETags do help when you use some kind of caching mechanism in front of your website-generator. Browsers themselves do not use them, they listen to "(if) modified since" or "age" header structs, afaik.
Anyway, due to its simple nature it is no problem to provide a http-header with an ETag. I heard that many web servers simply take the location of the file and the timestamp of the file and do a md5-hash over this data.
We, as an example, built a simple but effective etag with our software. Every "content unit" (i.e. html, jpegs, gifs...) in our software has a unique id and a version number (i.e. a jpeg has the id "17" and version "2", this means it was changed once). So the ETag simply is the string "id-version", here: "17-2". With the next change it would be "17-3" so that the cacher recognizes the change, loads the new content part (once) completely and stores it in it's own cache.
But you could probably use the URL and a timestamp (i.e. the timestamp of the file), too.
Mufasa,
Yahoo (and YSlow) actually encourage their use, but with the caveat that auto-generated ETags will differ from server to server.
I can't yet vote so I'll just say I agree with the suggestion of a hash of the file path and timestamp (or the table name + primary field value + timestamp if being represented by db content).
I just fired up YSlow and it complained about Etags, so I did a little research. The issue, as per the Yahoo blog (see the comments too)is that the default ETags implementations uses the file inode number or ntfs revision number or soemthing else equally server specific as a part of the hash. This, while being fast, basically prevents the same file being served by 2 different servers from having the same etag and screws up both browsers and downstream caches or load balances.
The previous suggestion to use an MD5 Hash is a good one, although you have to prevent that from becoming a performance problem in and of itself. The implementation of that suggestions remains up to the reader, although off-hand it seems to me like this is the sort of thing that your framework might be able to handle for you.
For myself, since I'm in a simple environment where the file timestamp will be more than adequate, I just turned them off in Apache using FileETag none
in my .htaccess file. This shuts up YSlow and should make things fall back to the last modified date on the file.