views:

43

answers:

1

In my current project (Rails 2.3) we have a collection of 1.2 million keywords, and each of them is associated with a landing page, which is effectively a search results page for a given keywords. Each of those pages is pretty complicated, so it can take a long time to generate (up to 2 seconds with a moderate load, even longer during traffic spikes, with current hardware). The problem is that 99.9% of visits to those pages are new visits (via search engines), so it doesn't help a lot to cache it on the first visit: it will still be slow for that visit, and the next visit could be in several weeks.

I'd really like to make those pages faster, but I don't have too many ideas on how to do it. A couple of things that come to mind:

  • build a cache for all keywords beforehand (with a very long TTL, a month or so). However, building and maintaing this cache can be a real pain, and the search results on the page might be outdated, or even no longer accessible;

  • given the volatile nature of this data, don't try to cache anything at all, and just try to scale out to keep up with traffic.

I'd really appreciate any feedback on this problem.

A: 

Something isn't quite adding up from your description. When you say 99.9% being new visits, that is actually pretty unimportant. When you cache a page you're not just caching it for one visitor. But perhaps you're saying that for 99.9% of those pages, there is only 1 hit every few weeks. Or maybe you mean that 99.9% of visits are to a page that only gets hit rarely?

In any case, the first thing I would be interested in knowing is whether there is a sizable percentage of pages that could benefit from full page caching? What defines a page as benefitting from caching? Well, the ratio of hits to updates is the most important metric there. For instance, even a page that only gets hit once a day could benefit significantly from caching if it only needs to be updated once a year.

In many cases page caching can't do much, so then you need to dig into more specifics. First, profile the pages... what are the slowest parts to generate? What parts have the most frequent updates? Are there any parts that are dependent on logged-in state of the user (doesn't sound like you have users though?)?

The lowest-hanging fruit (and what will propagate throughout the system) is good old fashioned optimization. Why does it take 2-seconds to generate a page? Optimize the hell out of your code and data store. But don't go doing things willy-nilly like removing all Rails helpers. Always profile first (NewRelic Silver and Gold are tremendously useful for getting traces from the actual production environment. Definitely worth the cost) Optimize your data store. This could be through denormalization or in extreme cases by switching to different DB technology.

Once you've done all reasonable direct optimization strategy, look at fragment caching. Can the most expensive part of the most commonly accessed pages be cached with a good hit-update ratio? Be wary of solutions that are complicated or require expensive maintenance.

If there is any cardinal rule to optimizing scalability cost it is that you want enough RAM to fit everything you need to serve on a regular basis, because this will always get you more throughput than disk access no matter how clever you try to be about it. How much needs to be in RAM? Well, I don't have a lot of experience at extreme scales, but if you have any disk IO contention then you definitely need more RAM. The last thing you want is IO contention for something that should be fast (ie. logging) because you are waiting for a bunch of stuff that could be in RAM (page data).

One final note. All scalability is really about caching (CPU registers > L1 cache > L2 cache > RAM > SSD Drives > Disc Drives > Network Storage). It's just a question of grain. Page caching is extremely coarse-grained, dead simple, and trivially scalable if you can do it. However for huge data sets (Google) or highly personalized content (Facebook), caching must happen at a much finer-grained level. In Facebook's case, they have to optimize down to the invidual asset. In essence they need to make it so that any piece of data can be accessed in just a few milliseconds from anywhere in their data center. Every page is constructed individually for a single user with a customized list of assets. This all has to be put together in < 500ms.

dasil003
Thanks for the detailed answer. What I meant about 99.9% new visits, is that a page being accessed once a few weeks on average, so it's usually a cache miss, and the page cache would need to have a very long TTL in order to be effective, so the data on this page would be no longer relevant. And there is no common data that we could possibly extract and fragment cache (at least nothing computationally expensive, only some static data).
Oleg Shaldybin
About profiling: I've done some profiling on these pages with NewRelic, there's no obvious bottlenecks in DB, and slow transaction traces show totally random pattern of time consumption for different calls in the request lifecycle. The CPU burn is also pretty high, so I think we're just low on resources on our server. Actually, playing with the size of Passenger pool helped a little bit. I'll also try scaling out to another server, I hope some load balancing will help.
Oleg Shaldybin