views:

133

answers:

3

I am looking for some documents on how Google crawl and index content. I read many "light" papers and articles on what you need to do to improve your ranking and make sure your content is properly indexed but I am looking for some more advanced technical documents on how Google crawl and index content.

The things I would like to know more about:

  • What elements Google look for when it crawls: page content, URLs format, keywords, description etc...
  • How the index is updated?

Basically, I am trying to understand why some pages are indexed but not others even if the formats are similar. Why only 10% of my site's pages appear when I do a search on the entire domain even if I can see on my server logs that Google crawled every single link.

A: 

MapReduce: Simplified Data Processing on Large Clusters

Mitch Wheat
That is not on crawling and indexing the web. This is about how to handle large data and calculation in general.
Laurent Luce
@Laurent Luce: ...and that's how Google crawls the web!
Mitch Wheat
This document doesn't explain how the content is crawled and indexed.
Laurent Luce
I added more details to the question to explain why this type of document is not relevant here.
Laurent Luce
@Laurent Luce: Google aren't going to tell you the precise details of how they index, now are they?
Mitch Wheat
+3  A: 

The answers to both things are closely-guarded trade secrets, ostensibly to prevent gaming the system.

Also keep in mind that Google makes over 400 algorithmic changes per year, making it close to impossible for an outsider to be accurate and up-to-date. Short of working for Google, you're likely not going to find an in-depth and accurate answer.

However, Matt Cutts, head of the web spam team, frequently provides the most accurate insights in how Google handles content, both on his blog and on the GoogleWebmasterHelp YouTube channel. It's worth going through his content to get a much better understanding of Google's methodology.

Mark Trapp
+1, although Page and Brin were nice enough to provide the paper "The Anatomy of a Search Engine" on their Stanford page. I guess that's the best you're going to get from Google. http://infolab.stanford.edu/~backrub/google.html
EnderMB
I have been reading the blog and it is very interesting indeed.
Laurent Luce
+1  A: 

In order to provide a technical approach of how a webcrawler works I will suggest you to take a deep look into nutch.apache.org solution.

A typical webcrawler displays the following areas, a fetcher, a parser, and indexer and a searcher. To put it briefly a webcrawler fetch all urls available on a website and creates segments where its store up to 101kb per page. Those pages are parsed but typical words such as and-or-the are not stored but other words are analyzed using bayesian calculations in order to make a rank.

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. These tasks are mainly performed by storing a list of occurrences of each search critera, typically in the form of a hash table or binary tree using an inverted index.

As Mark stated Google´s calculations are mainly trade secrets but Patents issued by google could be a good start. Pagerank http://en.wikipedia.org/wiki/PageRank analyses backlinks mainly and the importance that websites pointing to your site have on people´s preferences. In my experience its important to offer an xml sitemap stating all your webpages at your site. On that sitemap you could define the crawl frequency for each page. gsitecrawler.com/ is an interesting possibility.

Google Website Optimizer will give you the chance to see what is google finding on your site, logs are ok but probably the robot finds problem and the best way to know that is with google´s website optimizer in order to display errors.

Finally most of your concerns are things that SEO´s specialist live for, I suggest you to check sites like seomoz.com and their tools... You will learn how to position your website better on organic results on search engines.

hope it helps!, sebastian.

sebastian_h