Search Engines Inexact Counting (about xxx results) | ansaurus

tags:

views:

360

answers:

5

+2 Q:

Search Engines Inexact Counting (about xxx results)

When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"...

This has always amazed me... What does it mean "about"?
How can they count roughly?
I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one?

I'm sure there's a lot of theory behind this one that I missed...

+1 A:

Not relevant to your question, but reminds of a little joke a friend of mine made when doing a simple ego-search (and don't tell me you've never Googled your name). He said something like

"Wow, about 5,000 results in just 0.22 seconds! Now, imagine how many results this is in one minute, one hour, one day!"

petr k. 2008-09-28 23:26:21

Laughing out loud (SO won't let me just type LOL, grrr)

Daniel Magliola 2008-09-29 13:06:57

A:

Returning an exact number of results is not worth the overhead to accurately calculate. Since there's not much of a value add from knowing there was 1,004,345 results rather than 'about 1,000,000', it's more important from an end user experience perspective to return the results faster rather than the additional time to calculate the total.

From Google themselves: "Google's calculation of the total number of search results is an estimate. We understand that a ballpark figure is valuable, and by providing an estimate rather than an exact account, we can return quality search results faster."

mbrevoort 2008-09-28 23:30:08

I get that... My question is precisely "how" they do it.

Daniel Magliola 2008-09-29 13:25:45

+1 A:

I imagine the estimate is based on statistics. They aren't going to count all of the relevant page matches, so what they (I would) do is work out roughly what percentage of pages would match the query, based on some heuristic, and then use that as the basis for the count.

One heuristic might be to do a sample count - take a random sample of 1000 or so pages and see what percentage matched. It wouldn't take too many in the sample to get a statisically significant answer.

1800 INFORMATION 2008-09-28 23:33:38

+2 A:

Most likely it's similar to the sort of estimated row counts used by most SQL systems in their query planning; a number of rows in the table (known exactly as of the last time statistics were collected, but generally not up-to-date), multiplied by an estimated selectivity (usually based on a sort of statistical distribution model calculated by sampling some small subset of rows).

The PostgreSQL manual has a section on statistics used by the planner that is fairly informative, at least if you follow the links out to pg_stats and various other sections. I'm sure that doesn't really describe what google does, but it at least shows one model where you could get the first N rows and an estimate of how many more there might be.

puetzk 2008-09-28 23:34:26

+1 A:

One thing that hasn't been mentioned yet is deduplication. Some search engines (I'm not sure exactly how Google in particular does it) will use heuristics to try and decide if two different URLs contain the same (or extremely similar) content, and are thus duplicate results.

If there are 156 unique URLs, but 9 of those have been marked as duplicates of other results, it is simpler to say "about 150 results" rather than something like "156 results which contains 147 unique results and 9 duplicates".

carolclarinet 2009-01-05 15:13:55

related questions

How do I find the Excel column name that corresponds to a given integer?

Calculating a cutting list with the least amount of off cut waste.

Red-Black Trees

How to maintain a recursive invariant in a MySQL database?

RFC calculation in Java need help with algorithm

Best word wrap algorithm?

How do you separate game logic from display?

Most effective way for float and double comparison

Choosing a multiplier for a (string) hash function

Optimizing a search algorithm in C

Find the best combination from a given set of multiple sets

What "already invented" algorithm did you invent?

Designing a Calendar system like Google Calendar

How to overload std::swap()

Looking for algorithm that reverses the sprintf() function output

Merge Sort a Linked List

Puzzle: Find largest rectangle (maximal rectangle problem)

graph serialization

Peak detection of measured signal

Big O, how do you calculate/approximate it?

What problems can be solved, or tackled more easily, using graphs and trees?

Followup: "Sorting" colors by distinctiveness

Efficiently get sorted sums of a sorted list

Function for creating color wheels

Fastest way to get value of pi