tags:

views:

574

answers:

5

I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)

I started with this problem:

1. Single GET invalidated by batch update

GET /farms/123         # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123

How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?

Then I realized this was also a problem:

2. Batch GET invalidated by single (or batch) update

GET /farms/123,234,345 # get info about a few farms
PUT /farms/123         # update Old MacDonald's Farm
GET /farms/123,234,345

How does the cache know to invalidate the multiple-farm GET when it sees the PUT go by?

So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.

3. Single GET invalidated by update to a related record

GET /farms/123   # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123

How does the cache know to invalidate the single GET when it sees the PUT go by?

Even if you change the models to be more RESTful, using relationship models, you get the same problem:

GET    /farms/123           # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST   /farm_ownerships     # and buys another one
GET    /farms/123

In both versions of #3, the first GET should return something like (in JSON):

farm: {
  id: 123,
  name: "Shady Acres",
  size: "60 acres",
  farmer_id: 987
}

And the second GET should return something like:

farm: {
  id: 123,
  name: "Shady Acres",
  size: "60 acres",
  farmer_id: null
}

But it can't! Not even if you use ETags appropriately. You can't expect the caching server to inspect the contents for ETags -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.

So are there headers I'm missing? Things that indicate a cache should do a HEAD before any GETs for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.

And what about the problem of one cache receiving the PUT and knowing to invalidate its cache and another not seeing it?

+2  A: 

HTTP protocol supports a request type called "If-Modified-Since" which basically allows the caching server to ask the web-server if the item has changed. HTTP protocol also supports "Cache-Control" headers inside of HTTP server responses which tell cache servers what to do with the content (such as never cache this, or assume it expires in 1 day, etc).

Also you mentioned encrypted responses. HTTP cache servers cannot cache SSL because to do so would require them to decrypt the pages as a "man in the middle." Doing so would be technically challenging (decrypt the page, store it, and re-encrypt it for the client) and would also violate the page security causing "invalid certificate" warnings on the client side. It is technically possible to have a cache server do it, but it causes more problems than it solves, and is a bad idea. I doubt any cache servers actually do this type of thing.

SoapBox
+5  A: 

Cache servers are supposed to invalidate the entity referred to by the URI on receipt of a PUT (but as you've noticed, this doesn't cover all cases).

Aside from this you could use cache control headers on your responses to limit or prevent caching, and try to process request headers that ask if the URI has been modified since last fetched.

This is still a really complicated issue and in fact is still being worked on (e.g. see http://www.ietf.org/internet-drafts/draft-ietf-httpbis-p6-cache-05.txt)

Caching within proxies doesn't really apply if the content is encrypted (at least with SSL), so that shouldn't be an issue (still may be an issue on the client though).

frankodwyer
The original question doesn't mention cache servers, I think it was about browser local cache.
Karl
No, my original question states, "How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123 when it sees the PUT?" I meant both cache servers and local caches.
James A. Rosen
Re: SSL: see my comment about encrypted content over unencrypted channels.
James A. Rosen
+1  A: 

Unfortunately HTTP caching is based on exact URIs, and you can't achieve sensible behaviour in your case without forcing clients to do cache revalidation.

If you've had:

GET /farm/123
POST /farm_update/123

You could use Content-Location header to specify that second request modified the first one. AFAIK you can't do that with multiple URIs and I haven't checked if this works at all in popular clients.

The solution is to make pages expire quickly and handle If-Modified-Since or E-Tag with 304 Not Modified status.

porneL
A: 

In re: SoapBox's answer:

  1. I think If-Modified-Since is the two-stage GET I suggested at the end of my question. It seems like an OK solution where the content is large (i.e. where the cost of doubling the number of requests, and thus the overhead is overcome by the gains of not re-sending content. That isn't true in my example of Farms, since each Farm's information is short.)

  2. It is perfectly reasonable to build a system that sends encrypted content over an unencrypted (HTTP) channel. Imagine the scenario of a Service Oriented Architecture where updates are infrequent and GETs are (a) frequent, (b) need to be extremely fast, and (c) must be encrypted. You would build a server that requires a FROM header (or, equivalently, an API key in the request parameters), and sends back an asymmetrically-encrypted version of the content for the requester. Asymmetric encryption is slow, but if properly cached, beats the combined SSL handshake (asymmetric encryption) and symmetric content encryption. Adding a cache in front of this server would dramatically speed up GETs.

  3. A caching server could reasonably cache HTTPS GETs for a short period of time. My bank might put a cache-control of about 5 minutes on my account home page and recent transactions. I'm not terribly likely to spend a long time on the site, so sessions won't be very long, and I'll probably end up hitting my account's main page several times while I'm looking for that check I recently sent of to SnorgTees.

James A. Rosen
If-modified-since doesn't increase number of requests.
porneL
I'm pretty sure it does. If the cache could figure out what entries were current, it wouldn't have to send the If-Modified-Since request. You're right that it doesn't _double_ the number. It's dependent on the ratio of reads to writes.
James A. Rosen
If-Modified-Since doesn't doubly the requests -- the server just responds with either the resource (if it has changed) or a "Not modified" response, for which the client is supposed to use the version they had already.
Rowland Shaw
You're both right -- it doesn't double the number. But HTTP §13.2.1 ¶1 says, "HTTP caching works best when caches can entirely avoid making requests to the origin server." That's what I'm aiming for.
James A. Rosen
As I delve in, I see more and more that HTTP caching was built with the idea of caches reaching back to verify via If-Modified-Since. This seems like a lot of overhead, but it does seem to answer all of my problems.
James A. Rosen
It's impossible for a caching server to cache https gets since the SSL channel is opaque to the server - actually it doesn't even see these as normal HTTP they are done with the CONNECT method, which essentially punches a socket connection through the proxy.
frankodwyer
(actually I should add that there are some commercial proxies that can do some ugly spoofing of a CA to get around the SSL certificate warnings, but this is a really horrible solution and requires the proxy to be treated as a trusted CA)
frankodwyer
@frankodwyer -- I guess I always thought proxies could see the headers on SSL traffic. I'll take hat in hand on #3. Good comments.
James A. Rosen
My personal opinion is that any banking web application should ***NOT*** cache anything. If it's money related it's critical and if it's a bank it should afford hardware to serve all the uncached requests.
Andrei Rinea
+1  A: 

You can't cache dynamic content (withouth drawbacks), because... it's dynamic.

Karsten