views:

1100

answers:

7

What searching algorithm/concept is used in Google?

+2  A: 

PageRank is a link analysis algorithm used by Google for the search engine, but the patent was assigned to Stanford University.

TStamper
The patent is was actually registered by Stanford. The dirty little secret of PageRank is that Google doesn't own it -- Stanford does.
Deane
yeah i just noticed..it states that in the link that it was assigned to standford
TStamper
A: 

Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.

MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.

I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.

nevets1219
+3  A: 

Google's patented PigeonRank™

Wow, they initially posted this 7 years ago from Wednesday ...

Bratch
I believe this was a hoax : http://en.wikipedia.org/wiki/PigeonRank#2002:_Pigeon_Rank
TStamper
I think the wikipedia article is fake, PigeonRank(tm) is real!
CVertex
confirmed...it was an april fools joke- http://www.april-fools.us/google-pigeonrank.htm
TStamper
That link is also fake.
CVertex
From the orginal link (at the bottom): "Note: This page was posted for April Fool's Day - 2002." I should have placed, "Note: This page was posted for April Fool's Day - 2009" and waited 2 hours (PDT).
Bratch
lol..ok how about this forum - http://forums.digitalpoint.com/showthread.php?t=1674
TStamper
That link any other link posted from here on is fake.
CVertex
It's april 1st where i am
CVertex
how about Jon S. come confirm this
TStamper
J.S. confirmed it before it happened.
Bratch
when..cause i want to know for myself
TStamper
http://stackoverflow.com/users/209/cvertex <- this link is a fake
Pete Kirkham
PigeonRank is not a real algorithm name
TStamper
+5  A: 

Indexing

If you want to get down to basics:

Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.

Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.

For veteran users:

Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.

In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.

If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.

Ranking

To get down to basics:

Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.

For veteran users:

Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.

dustyburwell
A: 

I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated. Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems

bill
A: 

While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and Google’s Personalized Results

mikej