views:

855

answers:

9

I often marvel at how I can go to www.google.com, from anywhere in the world at any time, and get the returned page so fast.

Sure, they compress their output and keep to a minimal design - that helps.

But they must have millions of simultaneous hits to the box sitting on the web that DNS lists as "www.google.com".

All of you who have set up Apache or other web servers know that things are great and super fast until you start getting a few thousand simultaneous connections, let alone millions!

So, how do they do it? I guess they have a whole farm of server machines, but you'd never know it. When I went to Verizon just now the url was www22.verizon.com. You never see "www22.google.com", never.

Any ideas what specific technologies they use, or what technologies us non-Google mortals can use to do the same thing?

Thanks!

A: 

This is normal internet traffic handling. Google literally has entire data centers all over the planet that respond to www.google.com

Chris Lively
+3  A: 

http://www.akamai.com

Or, translated into English (and perhaps elaborating on Chris's answer), use a content delivery network (CDN) with nodes around the world - note that these are not just data-centers but actual web servers (though I'm sure most wouldn't make huge bones over the distinction). Akamai is but one example; hit google for "content delivery network" and you're bound to find others.

You can also implement a caching strategy, though that will likely not get you quite as far. ;)

Jason
+13  A: 

This article may be interesting for you:

Google Platform: The technological infrastructure behind Google's websites

CMS
That was fascinating
1800 INFORMATION
+1  A: 

In addition to large web farms, no doubt they're doing a lot of caching. They could cache anything from the page content, to frequent search terms. And caching is something that non-Google mortals can do too.

Bullines
I seem to recall reading somewhere that Google keeps nearly all of their page listings in memory at any given point in time.
Jason Baker
Caching isn't enough - millions of hits simultaneously asking for static web pages would still bring most setups to their knees. It's more to do with DNS
Draemon
+28  A: 

google.com, update.microsoft.com, and other services which handle astonishingly high aggregate bandwidth do much of their magic via DNS.

BGP Anycast routing is used to announce the IP address of their DNS servers from multiple points around the world. Each DNS server is configured to resolve google.com to IP addresses within a data center which is geographically close. So this is the first level of load balancing, based geographically.

Next, though a DNS query for google.com will return only a small number of IP addresses, the DNS server rapidly cycles through a large range of addresses in its responses. Each client requesting google.com will get a particular answer and will be allowed to cache that answer for a while, but the next client will get a different IP address. So this is the second level of load balancing.

Third, they use traditional server load balancers to map sessions to a single IP address to multiple backend servers. So this is a third level of load balancing.

DGentry
+6  A: 

At the Google open house in Austin last night, Alan Eustace showed a picture of Google's data center in The Dalles, Oregon and said it was the size of approximately 3 football fields.

It's one of the newer ones, but Google has multiple data centers. It's not like each query goes to the same computer.

Even so, if you guess at how many computers Google has, and how many queries are done against Google each second, each individual server must be handling an awful lot of requests.

Here's some reading on how this is facilitated:

http://labs.google.com/papers/bigtable.html
http://labs.google.com/papers/gfs.html

And just http://research.google.com/ in general, lots of cool info there.

Moishe
A: 

They also have custom Web Server, TCP/IP Stack [along with the infrastructure], I read somewhere years ago... I doubt if Apache or IIS or anyother commercial/popular Web Server can match that...

Vyas Bharghava
+2  A: 

Moishe is right: although simply delivering static web content at Google's scale is challenging enough, it's pretty well understood and lots of other people do the same.

However, it's really the delivery of dynamic content for which Google was the trailblazer, since their paper which started it all: The Anatomy of a Search Engine. There are lots of clever techniques, some of which have been mentioned here, but still... Do any query on Google, with query terms which don't belong together - they won't be cached - and you'll still get a result set back to you in a couple of hundred milliseconds: it's absolutely incredible.

To make it even more complex, there's the new SearchWiki functionality, which adds dynamic content onto every search result, and limited personalisation of results if you're logged in.

Google have been good in opening up (to some extent) the cleverness which makes it all happen. In the end, it all boils down to architecting everything to scale well horizontally. That's how Google could keep up with the exponential growth of the Internet: just add more hardware to your BigTable, Map/Reduce and the Google File System farms. By using lots of commodity hardware, with good infrastructure and management around it, Google could afford to keep the whole index in memory, and queries from one machine to another were quicker than going to disk.

Meanwhile, Yahoo! bought bigger and bigger monolithic machines, until Sun couldn't make them big enough any more and they had to switch over to Hadoop!, much too late.

Scaling the HTTP servers at Google is the easy part!

Alabaster Codify
+1  A: 

There was an excellent article on scaling HTTP services:

http://www.kegel.com/c10k.html

maurycy