views:

660

answers:

6

what is the best practice for database design for high traffic websites like this one stackoverflow?

should one must use normalize database for record keeping or normalized technique or combination of both?

is it sensible to design normalize database as main database for record keeping to reduce redundancy and at the same time maintain another denormalized form of database for fast searching?

or

main database should be denormalize and one can make normalized views in the application level for fast database operations?

or beside above mentioned approach?

what is the best practice of designing high traffic websites???

+3  A: 

Denormalizing the db to reduce the number of joins needed for intense queries is one of many different ways of scaling. Having to do fewer joins means less heavy lifting by the db, and disk is cheap.

That said, for ridiculous amounts of traffic good relational db performance can be hard to achieve. That is why many bigger sites use key value stores(e.g. memcached) and other caching mechanisms.

The Art of Capacity Planning is pretty good.

BaroqueBobcat
Disk SPACE is cheap, but disk performance most certainly is not. With a denormalized design, you often end up inserting or updating a greater volume of data on wider tables, and that often causes performance problems.
Dave Markle
True, there are trade offs with every decision. What is performant really depends on the structure of your data.
BaroqueBobcat
A: 

First: Define for yourself what hight-traffic means:

  • 50.000 Page-Viewss per day?
  • 500.000 Page-Views per day?
  • 5.000.000 Page-Views per day?
  • more?

Than calculate this down to pobalbe peak page-views per minute and per seconds. After that think about the data you want to query per page-view. Is the data cacheable? How dynamic is the data, how big is the data?

Analyze your individual requirements, program some code, do some load-testing, optimize. In most cases, before you need to scale out the database servers you need to scale out the web-servers.

Relational-database can be, if fully optimized, amazingly fast, when joining tables!

A relational-database could be hit seldom when to as a back-end, to populate a cache or fill some denormalized data tables. I would not make denomralization the default approach.

(You mentioned search, look into e.g. lucene or something similar, if you need full-text search.)

The best best-practice answer is definitely: It depends ;-)

Robert
A: 

For a project I'm working on, we've gone for the denormalized table route as we expect our major tables to have a high ratio of writes to reads (instead of all users hitting the same tables, we've denormalized them and set each "user set" to use a particular shard). You may find read http://highscalability.com/ for examples of how the "big sites" cope with the volume - Stack Overflow was recently featured.

Richy C.
+6  A: 

The performance hit of joining is frequently overestimated. Database products like Oracle are built to join very efficiently. Joins are often regarded as performing badly when the real culprit is a poor data model or a poor indexing strategy. People also forget that denormalised databases perform very badly when it comes to inserting or updating data.

The key thing to bear in mind is the type of application you're building. Most of the famous websites are not like regular enterprise applications. That's why Google, Facebook, etc don't use relational databases. There's been a lot of discussion of this topic recently, which I have blogged about.

So if you're building a website which is primarily about delivering shedloads of semi-structured content you probably don't want to be using a relational database, denormalised or otherwise. But if you're building a highly transactional website (such as an online bank) you need a design which guarantees data security and integrity, and does so well. That means a relational database in at least third normal form.

APC
A: 

Neither matters if you aren't caching properly.

Joe Chung
A: 

You can listen to a discussion on this very topic by the creators of stack overflow on thier podcast at:
http://itc.conversationsnetwork.org/shows/detail3993.html

g_g