views:

588

answers:

4

Summary

As I'm looking on stackoverflow and around the net, I find that there is a general lack of good documentation on best practices for caching a high performance site that uses sessions. It would be helpful if we can share some ideas around some basic building blocks particularly around caching. For the purpose of this discussion, I'm avoiding memcache and focusing on the caching of static and pages that are fully generated.

So to set up the scenario, imagine a web server (say nginx), reverse proxy (say varnish), app server (whatever), db server (say mysql).

Anonymous

  1. Static items (gif/jpg etc.)
  2. Semi dynamic (js/css)
  3. Dynamic

Logged In

  1. Static
  2. Semi dynamic (js/css)
  3. Dynamic

Generally speaking, all of the Anon should be cacheable and most of Logged In (ignore dynamic, no ESI for now).

Anon #1

  • Set far-off Expires
  • Set ETag if possible
  • Cache-Control: max-age=315360000

Anon #2 (have the reverse proxy cache the result if dynamically generated else Anon #1 rules apply)

  • Cache-Control: public, s-maxage=3000

Anon #3

  • Cache-Control: public, s-maxage=300

Logged In #1

  • Set far-off Expires
  • Set ETag if possible
  • Cache-Control: max-age=315360000

Logged In #2 (have the reverse proxy cache the result if dynamically generated else Logged In #1 rules apply)

  • Cache-Control: public, s-maxage=3000

Logged In #3

  • Cache-Control: s-maxage=0, must-revalidate

What are your suggestions? I'll update the post as answers come in.

+2  A: 

I don't know everything about caching, but here are some suggestions:

Anon #1,2: (static,semi-dynamic items) You could set them to never expire. If you need to change them, change their URL. If-modified-since checks are cheap but not free.

Anon #3: (dynamic items) Here's where ETags and/or Last-Modified comes in very handy. Depending on what you're serving, you can generate a good Last-Modified header. If your database stores the modified date of all items you were planning to show, you could something to the effect of SELECT MAX(last_updated) FROM items_to_show. Caveat: This takes into account the age of the data, and not the age of your template, so if you've changed your django template, you'd be at a loss as to how to communicate that in the header.

Or you could do something similar with an ETag. It could be a checksum of the contents that are generated. This will take the changing of the template into account.

Something to note with both of these approaches to caching dynamic content is that they really save more bandwidth than they do web server/database load. You can always make judicious use of the Expires header though to help in cases where the changes to a page are periodic and predictable.

My suggestions for the logged in stuff would be similar, except I would look at the Vary header. This can signal to caching proxies that different logged in users will not be served the same content.

In general, I would use either ETag or Last-modified, but not both.

+1  A: 

There are some relevant suggestions on the ySlow pages.

Etags might not be a good idea apparently.

Tom
I am well aware of the YSlow stuff. =) I was looking for input from people that worked through the intricacies of Cache-Control for different types of pages/content.
Jauder Ho
A: 

I would suggest reading Scalable Internet Architectures There are several chapters devoted to scaling up via caching, CDN etc. It should point you in the right direction to get going. Helped me scale up the site I support immensely.

--

MikeJ
Rather than suggestions to read books, I'm looking for more concrete input/experiences from people. More along the lines of "This is what I did...."
Jauder Ho
I recommended the book becuase its more practical in covering scale/performance and the best practices to achieve that end. It would be up to you then to pick an approach and emprically measure what works well at your site.
MikeJ
+2  A: 

My best answer to this, is that you have plenty of options for all of the static files, which can produce lots of gains in their own way, each beneficial in a specific scenario, so weigh up the pro's and cons according to your specific need.

However, what most people neglect to think about is their dynamic content, sure caching db results and the like are great, but still involve actually starting up the parsing engine of PHP/ASP or whatever.

If you look at the super-cache plugin for wordpress, you will note that it has the ability to actually prepare your html as static files. Not only that, but it also makes a gzip copy, and uses rewrite rules to check for the existence of these files as an appropriate alternative to starting up the parser. This is obviously going to give you the best result, as it is not only saving your processing time, but also bandwidth.

If you want to see the performance disparity, compare the apachebench results of <?php die('hello world'); with serving a static .html page.

Obviously you would need to be careful with this sort of caching, but it can be very useful to replace fullpage caching from inside an interpreter like PHP.

Bittarman