views:

178

answers:

5

Googlebot (Googlebot/2.1) appears to crawl URL:s on a newly added sites in an order corresponding to the length of the URL:

.. GET /ivjwiej/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /voeoovo/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /zeooviee/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /oveizuee/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /veiiziuuy/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /oweoivuuu/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /oeppwoovvw/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..
.. GET /aabieuuzii/ HTTP/1.1" 200 .. "Mozilla/5.0 (compatible; Googlebot/ ..

I've seen this exact pattern on multiple (>10) totally independent sites, so the ordering is not just a random coincidence.

Just to avoid confusion: the crawling order can seem like a very minor detail in how the Googlebot operates. And yes it really is a minor detail, but nevertheless I want to understand the technical details of how the Googlebot crawls the net. And the crawling ordering is one such detail. If you believe that this piece of knowledge is "useless" that is totally fine, but please don't pollute this page with answers since your contribution won't be very helpful. Answers that are not helpful will be downvoted in accordance to the SO house rules.

My questions are:

  1. Have you (yes, you personally - not a blog you read, etc.) observed this crawling pattern?
  2. Is the crawling pattern officially documented by Google?
  3. What could be the reasons behind choosing this crawling pattern?

Please try to address all three (3) questions.

A: 
  1. No, I haven't.
  2. No.
  3. Although this behavior seems to be really unusual, I think it could be the consequence of a bunch of coincidences rather than a crawling pattern. Unfortunately I would require more data (ex the real access log) before making assertions. Possible causes: 1. Are URLs listed in a sitemap? 2. Are URLs ordered in alphabetical order? 3. In which order do the URLs usually appear in a page?
Simone Carletti
weppos: I've seen it on multiple (>10) totally independent sites, so I'm 100 % certain that it is a predictible pattern and not just a random coincidence. Answers to your questions: 1.) No, 2.) No, 3.) Randomly.
knorv
Thanks! According to your information, this is likely to be a design decision. I'll try to do some test. :)
Simone Carletti
weppos: Great! Please let us know if you find anything useful!
knorv
+3  A: 

From a web-development perspective this non-random crawling pattern can give unexpected consequences; such as non-random load patterns if one specific URL-length corresponds to one type of particularly heavy transaction, etc.

if you have transaction pages accessible to search engine bots, then i call it fail. search engine bots shouldn't have any access to the transaction pages whatsoever! either forbid indexing it in robots.txt or on page in meta robots.

your three questions are thus useless - google doesn't document any algorithm they use. moreover, order of crawling is completely useless to know (or try to manipulate), since basically you don't care and want to get as many pages indexed as possible (except those you forbid in robots.txt).

dusoft
you can't vote down as much as you want, your question is still fail and you should rather listen to my recommendations here. (or take the possible consequences from google)
dusoft
What makes you believe that I have downvoted your answer?
knorv
He's totally right, transaction pages should not be included in a crawl. Why would you possibly want someone to land on your website halfway through a transaction anyway?
Nat Ryall
knorv: i didn't mean you in particular, just anyone downvoting in general...
dusoft
Kelix: Transaction was the wrong word, it should have been "particularly database heavy page". With that wording does the point get across?
knorv
ok, then i am sorry, but transaction means middle-in-the-process page.
dusoft
+1  A: 
  1. No
  2. No
  3. I don't believe that the crawling pattern actually matters. If the sequence Google finds your pages matters to your content - or even causes errors when accessed the wrong way then you have something seriously wrong with your site structure (or with your robots-metatags/robots.txt).

What I could observe in my projects was that Google tends to crawl pages just in the way the bot finds them. And this in turn depends on the way you 'present' them to Google (by means of links to the site, a sitemap, an rss feed etc.)

So I wouldn't worry too much about URL lengths, rather place a link to the pages you want to be found on a prominent, regularily crawled page.

msparer
It is not that I "worry too much about URL lenghts", I just want to know the mechanics of the Googlebot. Knowing dominates not knowing.
knorv
+1  A: 

I haven't experienced anything like this (though I never keep track of exactly which URLs are indexed and when). In my experience, Google indexes the URLs it considers most popular first. For example if it sees a link from a high-ranking page or from many pages, it will crawl that before others on the same site.

The only reasoning I can think of for your case is that Googlebot assumes longer URLs equate to a 'deeper' page, but ignoring the folder structure.

DisgruntledGoat
A: 

3: Maybe goggle bot stores URL strings in a tree data structure. The first URL which is the shortest is the tree root so next URLs "endings" will only append as tree leafs. This would be more optimum than storing each URL as a separate string (for example in cases like /lang_english/ /lang_italian/ /lang_german/).

anonymous_geek