views:

85

answers:

2

I'm currently developing the foundation of a social platform, and looking for ways to optimize performance. My setup is based on the CakePHP framework, but I believe my question is relevant to any technology stack, as it relates to data caching.

Let's take a typical post-author relation, which is represented by 2 tables in my db. When I query the database for a specific blog post, at the same time the built-in ORM functionality in CakePHP also fetches the author of the post, comments on the post, etc. All of this is returned as one big-ass nested array, which I store in cache using a unique identifier for the concerned blog post.

When updating the blog post, it is child play to destroy the cache for the post, and have it regenerated with the next request.

But what happens when not the main entity (in this case the blog post) gets updated, but rather some of the related data? For example, a comment could be deleted, or the author could update his avatar. Are there any approaches (patterns) which I could consider for tracking updates to related data, and applying updates to my cache accordingly?

I'm curious to hear whether you've also run into similar challenges, and how you have managed to potentially overcome the hurdle. Feel free to provide an abstract perspective, if you're using another stack on your end. Your views are anyhow much appreciated, many thanks!

+1  A: 

One Approach for memcached is to use tags ( http://code.google.com/p/memcached-tag/ ). For Example, you have your Post "big-ass nested array" lets say, it inclused the autors information, the post itself and is shown on the frontpage and in some box in the sidebar. So it gets the tags: frontpage, {auhothor-id}, sidebar, {post-id} - now if someone changes the Author Information you flush every cache entry with the tag {author-id}. But thats only one Solution, and only for Cache Backends that support Tags, for example not APC (afaik). Hope That gave you an example.

Hannes
Hannes, thx for pointing this out. I have looked into memcached, but didn't realize that this option existed.
Shahways
+2  A: 

It is rather simple, cache entries can be

  • added
  • destroyed

You should take care of destroying cache entries when related data change (so in application layer in addition to updating the data you should destroy certain types of cached entries when you update certain tables; you keep track of dependencies by hard-coding it).

If you'd like to be smart about it you could have your cache object state their dependencies and cache the last update times for your DB tables as well.

Then you could

  • fetch cached data, examine dependencies,
  • get update times for relevant DB tables and
  • in case the record is stale (update time of a table that your big ass cache entry depends on is later then the time of the cache entry) drop it and get fresh data from the database.

You could even integrate the above into your persistence layer.

EDIT:
Of course the above is for when you want to have consistent cache. Sometimes, and for some data, you can relax the consistency requirements and there are scenarios where simple TTL will be good enough (for a trivial example, if you have ttl of 1 sec, you should mostly be out of trouble with users and can help data processing; and with higher times you might still be ok - for example let's say you are caching the list of country ISO codes; your application might be perfectly ok if you say let's cache this for 86400 sec).

Furthermore, you could also track the times of information presented to user, for example

  • let's say user has seen data A from cache and that we know that this data was created/modified at time t1
  • user makes changes to the data A (and makes it data B) and commits the change
  • the application layer can then examine if the data A is still as in DB (if the cached data upon which the user made decisions and/or changes was indeed fresh)
  • if it was not fresh then there is a conflict and user should confirm the changes

This has a cost of extra read of data A from DB, but it occurs only on writes. Also, the conflict can occur not only because of the cache, but also because of multiple users trying to change the data (i.e. it is related to locking strategies).

Unreason
I was afraid that this was going to bethe only way out, but thanks for confirming that, since you managed to make it sound more "do-able" :-)So the downside is a lot of custom code per view, but the upside is that this would still allow easy switching of the caching back-end.Thx also for the additional insight into the locking feature, this will be for v2 :-)
Shahways
For more complex conceptual overview and some pratical issues, I found this read interesting http://highscalability.com/blog/2010/9/30/facebook-and-site-failures-caused-by-complex-weakly-interact.html
Unreason