views:

117

answers:

2

This is actually a real problem I'm working on, but for simplicity, let's pretend I'm Google.

Say the user searches for "nanoscale tupperware". There aren't very many pages with both words... only about 3k. But there are ~2 million pages with "nanoscale" and ~4 million with "tupperware". Still, Google finds the 3k for me in 0.3 seconds.

How does it do it?

The only algorithm I'm aware of is to get the documents for "nanoscale", get the documents for "tupperware", and then do a list merge. But that's O(N + M), or O(5,000,000) which seems a little slow. Particularly if I'm running it on a desktop instead of an uber-fast cluster.

So is that actually what Google is doing, and their speed is due mostly to the fact that they're running this expensive computation on their massive distributed cluster?

Or is there a better algorithm that I'm not aware of? Wikipedia and Google aren't turning up anything for me.

Edit:

Since people seem to be focusing on the Google aspect of my question, I guess I'll restate it in the actual terms.

I have several very large (millions of items) indexes implemented as key/value pairs. Keys are simple words, values are Sets of documents. A common use case is to get the intersection of results on several searches on different indexes: the pain point is getting the intersection of the document sets.

I can re-implement my indexes however I want - it's mostly an academic project at this point.

+1  A: 

What you're describing are called n-grams.

Google uses an algorithm called PageRank to search and sort the results which is implemented using MapReduce.

All of these topics have been discussed at length on Stackoverflow in the past. It should be fairly easy to look them up.

This probably doesn't help you a whole bunch since you likely don't have an enormous distributed system to run MapReduce on, but since you didn't really give us any details on what you're trying to index, it's difficult to suggest something that suited to your problem.

Ben S
+1  A: 

The way you're describing it, you already have an inverted index, with a posting list for each term (the list of documents). I'm not aware of a better solution than merge joining the posting lists for each term, and to the best of my knowledge, that's what fulltext indexing solutions like Lucene do. There's a couple of obvious optimisations you can make here, though:

  1. If you can store your dataset in memory, even distributed across many machines, you can merge join the result sets very quickly indeed, compared to what'd be required for a disk seek.
  2. The 'naive' merge join algorithm advances one pointer by one position on each non-match, but if your posting lists are themselves indexed, you can do a lot better, by taking the maximum of the individual current values, and seeking in all the other posting lists to the first value greater than or equal to that key - possibly skipping millions of irrelevant results in the process. This has been called a zig-zag merge join.
Nick Johnson