Something isn't quite adding up from your description. When you say 99.9% being new visits, that is actually pretty unimportant. When you cache a page you're not just caching it for one visitor. But perhaps you're saying that for 99.9% of those pages, there is only 1 hit every few weeks. Or maybe you mean that 99.9% of visits are to a page that only gets hit rarely?
In any case, the first thing I would be interested in knowing is whether there is a sizable percentage of pages that could benefit from full page caching? What defines a page as benefitting from caching? Well, the ratio of hits to updates is the most important metric there. For instance, even a page that only gets hit once a day could benefit significantly from caching if it only needs to be updated once a year.
In many cases page caching can't do much, so then you need to dig into more specifics. First, profile the pages... what are the slowest parts to generate? What parts have the most frequent updates? Are there any parts that are dependent on logged-in state of the user (doesn't sound like you have users though?)?
The lowest-hanging fruit (and what will propagate throughout the system) is good old fashioned optimization. Why does it take 2-seconds to generate a page? Optimize the hell out of your code and data store. But don't go doing things willy-nilly like removing all Rails helpers. Always profile first (NewRelic Silver and Gold are tremendously useful for getting traces from the actual production environment. Definitely worth the cost) Optimize your data store. This could be through denormalization or in extreme cases by switching to different DB technology.
Once you've done all reasonable direct optimization strategy, look at fragment caching. Can the most expensive part of the most commonly accessed pages be cached with a good hit-update ratio? Be wary of solutions that are complicated or require expensive maintenance.
If there is any cardinal rule to optimizing scalability cost it is that you want enough RAM to fit everything you need to serve on a regular basis, because this will always get you more throughput than disk access no matter how clever you try to be about it. How much needs to be in RAM? Well, I don't have a lot of experience at extreme scales, but if you have any disk IO contention then you definitely need more RAM. The last thing you want is IO contention for something that should be fast (ie. logging) because you are waiting for a bunch of stuff that could be in RAM (page data).
One final note. All scalability is really about caching (CPU registers > L1 cache > L2 cache > RAM > SSD Drives > Disc Drives > Network Storage). It's just a question of grain. Page caching is extremely coarse-grained, dead simple, and trivially scalable if you can do it. However for huge data sets (Google) or highly personalized content (Facebook), caching must happen at a much finer-grained level. In Facebook's case, they have to optimize down to the invidual asset. In essence they need to make it so that any piece of data can be accessed in just a few milliseconds from anywhere in their data center. Every page is constructed individually for a single user with a customized list of assets. This all has to be put together in < 500ms.