You basically can't do point 1 in real time -- the time interval is just too short. So you need to analyze beforehand all the pages you're going to be serving ads on, and store that information in a way that it can be rapidly accessed at ad-serving time.
That doesn't necessarily imply "being a search engine company": presumably you're not going to serve ads on billions of different URLs, after all, but only on a far smaller number of URLs that belong to your company or its partners (so you can presumably also get collaboration from the URLs' owners: e.g., you don't need a general spider but can rely on the owners using the sitemaps protocol properly to let you know about new, updated or removed URLs, you can trust each page's keywords , title and headers to provide important info, and so forth).
So with a relatively small number of servers (say a few dozens, maybe in EC2 or other "cloud" service) you can keep an in-memory distributed hash table mapping URLs to (for example) sets of related keywords and weights for keywords' relative importance, and a similar table for candidate ads -- indeed, if you don't have a "real-time auction" aspect to your system, you might even get away with precomputing a URL-to-ads correspondence (presumably you do want to do some dynamic adjustment, auction-wise or other, but with some reasonable approximation that can be modeled as a simple incremental op on the precomputed correspondence).
If you do need to scale to serving ads on billions of URLs, then you do need a far more sophisticated approach than can be effectively summarized on a SO answer -- but then, if that's the scale of your ambition, you had better put together an engineering team that's not daunted by the task (and far more than a few dozen servers;-).