This is actually a real problem I'm working on, but for simplicity, let's pretend I'm Google.
Say the user searches for "nanoscale tupperware". There aren't very many pages with both words... only about 3k. But there are ~2 million pages with "nanoscale" and ~4 million with "tupperware". Still, Google finds the 3k for me in 0.3 seconds.
How does it do it?
The only algorithm I'm aware of is to get the documents for "nanoscale", get the documents for "tupperware", and then do a list merge. But that's O(N + M), or O(5,000,000) which seems a little slow. Particularly if I'm running it on a desktop instead of an uber-fast cluster.
So is that actually what Google is doing, and their speed is due mostly to the fact that they're running this expensive computation on their massive distributed cluster?
Or is there a better algorithm that I'm not aware of? Wikipedia and Google aren't turning up anything for me.
Edit:
Since people seem to be focusing on the Google aspect of my question, I guess I'll restate it in the actual terms.
I have several very large (millions of items) indexes implemented as key/value pairs. Keys are simple words, values are Sets of documents. A common use case is to get the intersection of results on several searches on different indexes: the pain point is getting the intersection of the document sets.
I can re-implement my indexes however I want - it's mostly an academic project at this point.