I've been thinking about batch reads and writes in a RESTful environment, and I think I've come to the realization that I have broader questions about HTTP caching. (Below I use commas (",") to delimit multiple record IDs, but that detail is not particular to the discussion.)
I started with this problem:
1. Single GET
invalidated by batch update
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farms/123,234,345 # update info on Old MacDonald's Farm and some others
GET /farms/123
How does a caching server in between the client and the Farms server know to invalidate its cache of /farms/123
when it sees the PUT
?
Then I realized this was also a problem:
2. Batch GET
invalidated by single (or batch) update
GET /farms/123,234,345 # get info about a few farms
PUT /farms/123 # update Old MacDonald's Farm
GET /farms/123,234,345
How does the cache know to invalidate the multiple-farm GET
when it sees the PUT go by?
So I figured that the problem was really just with batch operations. Then I realized that any relationship could cause a similar problem. Let's say a farm has zero or one owners, and an owner can have zero or one farms.
3. Single GET
invalidated by update to a related record
GET /farms/123 # get info about Old MacDonald's Farm
PUT /farmers/987 # Old MacDonald sells his farm and buys another one
GET /farms/123
How does the cache know to invalidate the single GET when it sees the PUT go by?
Even if you change the models to be more RESTful, using relationship models, you get the same problem:
GET /farms/123 # get info about Old MacDonald's Farm
DELETE /farm_ownerships/456 # Old MacDonald sells his farm...
POST /farm_ownerships # and buys another one
GET /farms/123
In both versions of #3, the first GET should return something like (in JSON):
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: 987
}
And the second GET should return something like:
farm: {
id: 123,
name: "Shady Acres",
size: "60 acres",
farmer_id: null
}
But it can't! Not even if you use ETag
s appropriately. You can't expect the caching server to inspect the contents for ETag
s -- the contents could be encrypted. And you can't expect the server to notify the caches that records should be invalidated -- caches don't register themselves with servers.
So are there headers I'm missing? Things that indicate a cache should do a HEAD
before any GET
s for certain resources? I suppose I could live with double-requests for every resource if I can tell the caches which resources are likely to be updated frequently.
And what about the problem of one cache receiving the PUT
and knowing to invalidate its cache and another not seeing it?