views:

360

answers:

5

When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"...

This has always amazed me... What does it mean "about"?
How can they count roughly?
I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one?

I'm sure there's a lot of theory behind this one that I missed...

+1  A: 

Not relevant to your question, but reminds of a little joke a friend of mine made when doing a simple ego-search (and don't tell me you've never Googled your name). He said something like

"Wow, about 5,000 results in just 0.22 seconds! Now, imagine how many results this is in one minute, one hour, one day!"

petr k.
Laughing out loud (SO won't let me just type LOL, grrr)
Daniel Magliola
A: 

Returning an exact number of results is not worth the overhead to accurately calculate. Since there's not much of a value add from knowing there was 1,004,345 results rather than 'about 1,000,000', it's more important from an end user experience perspective to return the results faster rather than the additional time to calculate the total.

From Google themselves: "Google's calculation of the total number of search results is an estimate. We understand that a ballpark figure is valuable, and by providing an estimate rather than an exact account, we can return quality search results faster."

mbrevoort
I get that... My question is precisely "how" they do it.
Daniel Magliola
+1  A: 

I imagine the estimate is based on statistics. They aren't going to count all of the relevant page matches, so what they (I would) do is work out roughly what percentage of pages would match the query, based on some heuristic, and then use that as the basis for the count.

One heuristic might be to do a sample count - take a random sample of 1000 or so pages and see what percentage matched. It wouldn't take too many in the sample to get a statisically significant answer.

1800 INFORMATION
+2  A: 

Most likely it's similar to the sort of estimated row counts used by most SQL systems in their query planning; a number of rows in the table (known exactly as of the last time statistics were collected, but generally not up-to-date), multiplied by an estimated selectivity (usually based on a sort of statistical distribution model calculated by sampling some small subset of rows).

The PostgreSQL manual has a section on statistics used by the planner that is fairly informative, at least if you follow the links out to pg_stats and various other sections. I'm sure that doesn't really describe what google does, but it at least shows one model where you could get the first N rows and an estimate of how many more there might be.

puetzk
+1  A: 

One thing that hasn't been mentioned yet is deduplication. Some search engines (I'm not sure exactly how Google in particular does it) will use heuristics to try and decide if two different URLs contain the same (or extremely similar) content, and are thus duplicate results.

If there are 156 unique URLs, but 9 of those have been marked as duplicates of other results, it is simpler to say "about 150 results" rather than something like "156 results which contains 147 unique results and 9 duplicates".

carolclarinet