views:

340

answers:

3

I've been reading this book (highly recommended):

alt text

And I have a particular question about the ETag chapter. The author says that ETags might harm performance and that you must tune them finely or disable them completely.

I understand the risks, but is it that hard to get ETags right?

I've just made an application that sends an ETag whose value is the MD5 hash of the response body. This is a simple solution, easy to achieve in many languages.

My question is: Is this wrong? if so, why? and why the author (who obviously outsmarts me by many orders of magnitude) does not propose such a simple solution?

Thanks

EDIT Please Read!

People are misunderstanding the question. I already know what ETags are, I'm just asking if a MD5 hash of the response body is fine enough, and if so why is not encouraged on the book? This last question is hard to answer unless you are the author :) so I'm trying to find the weak points of using a MD5 hash as an ETag.

A: 

I think the perceived problem with ETAGS is probably that your browser has to issue and parse a (simple and small) request / response for every resource on your page to check if the etag value has changed server side.

I personally find these extra small roundtrips to the server acceptable for often changing images, css, javascript (the server does not need to resend the content if the browser's etag is current) since the mechanism makes it quite easy to mark 'updated' content.

ChristopheD
The problem mentioned in the book is that you need to come up with a special and maybe __smart__ strategy (the author even encourages to drop support from etags if you cannot find a good strategy). That's what I'm finding weird, is MD5 a good solution? if so why not just say that?
Pablo Fernandez
A proper `max-age` or `Expires` would let the client know how much to wait without sending even that tiny "is there anything new?" request. So you can save the roundtrips too.
Nicolás
@Pablo Fernandez: MD5 is fine, but I personally would not hash the entire contents of the file. Hashing the 'last file modification date' should prove enough. About the `why not just say that?` bit: the answer is probably right in the book title (High performance web sites). Etags (and their roundtrips) do add some overhead and could be an important factor to consider on a heavily loaded webserver (but at the same time they add flexibility)...
ChristopheD
@Nicolás: true, but `max-age` or `expires` can't make any guarantees for you that the client is always (!) receiving the most up-to-date content.
ChristopheD
Hashing the modification date would be useless. If you're going to do that, you might as well drop ETags and let the client use Last-Modified + If-Modified-Since. The whole point of ETags is that they have better than 1-second resolution, and can go "back" to an ETag sent previously.
Nicolás
@Nicolás: very true (point taken). The last-modified / if-modified-since combo would behave nearly identical to an etag signifying a last-changed-timestamp (and they are probably a better fit for this job ;-).
ChristopheD
+1  A: 

Having not read the book, I can't speak on the author's precise concerns.

However, the generation of ETags should be such that an ETag is only generated once when a page has changed. Generating an MD5 hash of a web page costs processing power and time on the server; if you have many clients connecting, it could start to cause performance problems.

Thus, you need a good technique for generating ETags only when necessary and caching them on the server until the related page changes.

Dancrumb
I have to digitally sign every server response with a shared secret. So the ETag was a nice side effect :)
Pablo Fernandez
+3  A: 

ETag is similar to the Last-Modified header. It's a mechanism to determine change by the client.

Arguably, an ETag that JUST HAPPENS to be the Last Modified date (i.e. the same text) meets all the criteria necessary for an ETag. It simply needs to be a unique value representing the state of a resource. Not unique across the entire domain of resources, simply within the resource.

Now, technically, an ETag has "infinite" resolution compared to a Last-Modified header. Last-Modified only changes at a granularity of 1 second, whereas an ETag can be sub second.

You can implement both ETag and Last-Modified, or simply one or the other (or none, of course). If you Last-Modified is not sufficient, then consider an ETag.

Mind, I would not set ETag for "every" resource. Basically, I wouldn't set it for anything that has no expectation of being cached (dynamic content notably). There's no point in that case, just wasted work.

Edit: I see your edit, and clarify.

MD5 is fine. The only downside is calculating MD5 all the time. Running MD5 on, say, a 200K PDF file, is expensive. Running MD5 on a resource that has no expectation of being cashed is simply wasteful (i.e. dynamic content).

The trick is simply that whatever mechanism you use, it should be as cheap as Last-Modified typically is. Last-Modified is, again, typically, a property of the resource, and usually very cheap to access.

ETags should be similarly cheap. If you are using MD5, and you can cache/store the association between the resource and the MD5 hash, then that's a fine solution. However, recalculating the MD5 each time the ETag is necessary, is basically counter to the idea of using ETags to improve overall server performance.

Will Hartung
Thanks. In my particular case I already have the MD5 because I'm digitally signing the requests, but I see this might be a performance problem for other scenarios. Thanks!
Pablo Fernandez