views:

859

answers:

10

This question is meant to serve as a list of databases and their configurations that the major web-sites use and would be a great reference for anyone thinking of scaling their web-site to the size of Twitter, Facebook or even Google.

Please keep your answers to a minimum and be sure to cite any sources used.

EDIT:

Also, please bold both the web-site name and the database for easier scanning.

+3  A: 

Digg.com

  • MySQL (Relational Database) for scaling out reads
  • MemcacheDB (Key-Value Store) for scaling out writes

Both data stores are distributed across multiple servers.

Digg stats:

  • 30M users
  • 26M uniques per month
  • 2 billion requests a month
  • 13,000 requests a second, peak at 27,000 requests a second.

Sources:

niktech
Digg recently migrated "green badge" feature to cassandra: http://blog.digg.com/?p=966
Vladimir
+4  A: 

LinkedIn.com

  • Oracle (Relational Database)
  • MySQL (Relational Database)

Databases replicated on multiple servers for high availability. Each specific Service uses its own domain-specific DB.

LinkedIn stats:

  • 22 million members
  • 4+ million unique visitors/month
  • 40 million page views/day
  • 2 million searches/day

Sources:

niktech
+3  A: 

Microsoft.com

  • SQL Server (no surprise there)

Microsoft.com stats:

  • 250 million unique visits/month.
  • 70 million page views/day.
  • 15,000 connections/second.
  • Maintains an average of 35,000 concurrent connections to a total of 80 Web servers.

Sources:

Fredrik Mörk
+1  A: 

Google uses BigTable: http://labs.google.com/papers/bigtable.html

stribika
+6  A: 

Facebook.com

  • Hive (Data warehouse for Hadoop, supports tables and a variant of SQL called hiveQL). Used for "simple summarization jobs, business intelligence and machine learning and many other applications"
  • Cassandra (Multi-dimensional, distributed key-value store). Currently used for Facebook's private messaging.

Currently running 610 (soon to be 1000) Hadoop nodes in a single cluster with Hive datastore. Both Hive and Cassandra have been open-sourced by Facebook.

Facebook stats:

  • More than 200 million active users
  • More than 100 million users log on to Facebook at least once each day
  • More than 30 million users update their statuses at least once each day
  • Average user has 120 friends on the site

Sources:

niktech
+1  A: 

Twitter.com

  • MySQL (Relational Database).
  • Cassandra (Multi-dimensional, distributed key-value store). Twitter is just "beginning to use Cassandra at Twitter" (see second source).

In May 2008, Twitter had 1 MySQL instance for writes with multiple MySQL slave instances for reads.

Twitter stats:

  • Total Users: 1+ million
  • Total Active Users: 200,000 per week
  • Total Twitter Messages: 3 million/day
  • 5% of Twitter users account for 75% of all activity
  • 72.5% of all users joining during the first five months of 2009

Sources:

niktech
+1  A: 

PlentyOfFish.com using Microsoft SQL Server:

http://www.codinghorror.com/blog/archives/001279.html

duffymo
+2  A: 

Yahoo.com

  • PostgreSQL (modified) - A client can connect to any of the nodes in the cluster (or a policy restricted subset). A query flows from the client to the server it chose to connect with. The SQL compiler on that node compiles and optimizes the query on that single node (no parallelism).

Yahoo.com stats:

  • 24 billion events a day
  • 2-petabyte, claims largest database (Mar 2008)

Source:

KahWee Teng
2 peta byte of what? :)
Skurmedel
It's described as "structured data, as opposed to unstructured data like e-mail and other documents." Hassan, the Data VP added, "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective."
KahWee Teng
+1  A: 

Flickr uses MySql.

YouTube uses MySql but they moving to google's BigTable.

Myspace uses Sql Server.

wikipedia uses MySql.

Mohammed Nasman
+6  A: 

StackOverflow - Sql server

Pandiya Chendur