tags:

views:

139

answers:

3

Does anyone know any more details about google's web-crawler (aka GoogleBot)? I was curious about what it was written in (I've made a few crawlers myself and am about to make another) and if it parses images and such. I'm assuming it does somewhere along the line, b/c the images in images.google.com are all resized. It also wouldn't surprise me if it was all written in Python and if they used all their own libraries for most everything, including html/image/pdf parsing. Maybe they don't though. Maybe it's all written in C/C++. Thanks in advance-

A: 

Officially allowed languages at Google, I think, are Python/C++/Java.

The bot likely uses all 3 for different tasks.

Coronatus
+1  A: 

you can find a bit about how googlebot works here:

http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=158587

for example the "fetch as googlebot" tool lets you see a page as Googlebot sees it.

Probes
+1  A: 

The crawler is very likely written in C or C++, at least backrub's crawler was written in one of these.

Be aware that the crawler only takes a snapshot of the page, then stores it in a temporary database for later processing. The indexing and other attached algorithms will extract the data, for example the image references.

methode