How to cache pages using background jobs ?

views:

answers:

+1 Q:

How to cache pages using background jobs ?

Definitions: resource = collection of database records, regeneration = processing these records and outputting the corresponding html

Current flow:

Receive client request
Check for resource in cache
If not in cache or cache expired, regenerate
Return result

The problem is that the regeneration step can tie up a single server process for 10-15 seconds. If a couple of users request the same resource, that could result in a couple of processes regenerating the exact same resource simultaneously, each taking up 10-15 seconds.

Wouldn't it be preferrable to have the frontend signal some background process saying "Hey, regenerate this resource for me".

But then what would it display to the user? "Rebuilding" is not acceptable. All resources would have to be in cache ahead of time. This could be a problem as the database would almost be duplicated on the filesystem (too big to fit in memory). Is there a way to avoid this? Not ideal, but it seems like the only way out.

But then there's one more problem. How to keep the same two processes from requesting the regeneration of a resource at the same time? The background process could be regenerating the resource when a frontend asks for the regeneration of the same resource.

I'm using PHP and the Zend Framework just in case someone wants to offer a platform-specific solution. Not that it matters though - I think this problem applies to any language/framework.

Thanks!

+2 A:

With Varnish you can proactively cache page content and use grace to display stale, cached content if a response doesn't come back in time.

Enable grace period (varnish serves stale (but cacheable) objects while retriving object from backend)

You may need to tweak the dials to determine the best settings for how long to serve the stale content and how long it takes something to be considered stale, but it should work for you. More on the Varnish performance wiki page.

Nick Gerakines 2010-05-14 18:10:01

+1 A:

I recommend caching in webserver level rather than the application

Behdad 2010-05-14 18:58:53

+1 A:

I have done just this recently for a couple of different things, in each case, the basics are the same - in this instance the info can be pre-generated before use.

A PHP job is run regularly (maybe from CRON) which generates information into Memcached, which is then used potentially hundreds of times till it's rebuilt again.

Although they are cached for well-defined periods (be it 60 mins, or 1 minute), they are regenerated more often than that. Therefore, unless something goes wrong, they will never expire from Memcache, because a newer version is cached before they can expire. Of course, you could just arrange for them to never expire.

I've also done similar things via a queue - you can see previous questions I've answered regarding 'BeanstalkD'.

Alister Bulman 2010-05-14 22:08:12

Depending on the content the jQuery.load() might be an option. (I used it for a twitter feed)

Step 1
Show the cached version of the feed.

Step 2
Update the content on the page via jQuery.load() and cache the results.

.
This way the page loads fast and displays up2date content (after x secs offcourse)
But if rebuilding/loading a full page this wouldn't give a nice user experience.

Bob Fanger 2010-05-14 22:49:38

You describe a few problems, perhaps some general ideas would be helpful.

One problem is that your generated content is too large to store entirely so you can only cache a subset of that total content, you will need: a method for uniquely identifying each content object that can be generated, a method for identifying if a content object is already in the cache, a policy for marking data in the cache stale to indicate that background regeneration should be run, and a policy for expiring and replacing data in the cache. Ultimately keeping the unique content identification simple should help with performance while your policy for expiring objects and marking stale objects should be used to define the priority for background regeneration of content objects. These may be simple updates to your existing caching scheme, on the other hand it may be more effective for you to use a software package specifically made to address this need as it is not an uncommon problem.

Another problem is that you don't want to duplicate the work to regenerate content. If you have multiple parallel generation engines with differing capabilities this may not be so bad of a thing and it may be best to queue a task to each and remove the task from all other queues when the first generator completes the job. Consider tracking the object state when a regeneration is in progress so that multiple background regeneration tasks can be active without duplicating work unintentionally. Once again, this can be supplanted into your existing caching system or handled by a dedicated caching software package.

A third problem is concerned with what to do when a client requests data that is not cached and needs to be regenerated. If the data needs to be fully regenerated you will be stuck making the client wait for regeneration to complete, to help with long content generation times you could identify a policy for predictive prefetching content objects into cache but requires a method to identify relationships between content objects. Whether you want to serve the client a "regenerating" page until the requested content is available really depends on your client's expectations. Consider multi-level caches with compressed data archives if content regeneration cannot be improved from 10-15 seconds.

Making good use of a mature web caching software package will likely address all of these issues. Nick Gerakines mentioned Varnish which appears to be well suited to your needs.

2010-05-15 12:26:55

ansaurus

tags:

views:

answers:

How to cache pages using background jobs ?

related questions