views:

98

answers:

4

I alway wondered how could a very big site like facebook to be faster than any other sites ,though the very big large amount of data which stored everyday ..
what they are using to store information and if I use sql server to store e.g news feed is that ok or what (the news feed will be stored in a separate table which called News) .
in the other hand what could happen if I joined many huge tables with each other - it should be slow (maybe) or it doesn't matter how big the table is !?

thanx :)

+1  A: 

It depends, Facebook is very fast because they have a server farm, so queries are optimised and each single query hits many servers.

In regards to huge tables, they can be fast as long as you have enough physical memory to index whatever you need to search on. Having correct index's can improve database performance hugely (When it comes to retrieving data).

As long as it makes sense to join many huge tables together into one then yes, but if they're separate, and not related then no. If you provide more details on what kind of tables you would be looking to merge, we might be able to help you more.

LnDCobra
e.g to join table USERS with table NEWS .. each table has a very large amount of data , besides they're related , and if you mean with index's the Id col then yes they have correct index's ..
Rawhi
Joining USERS table with NEWS is NOT a good idea!!! What will you be searching on most? most recent news? if so make sure the DatePosted(or similar) is indexed, and anything you use in a WHERE clauses inside a big table(including temporary tables) is indexed.
LnDCobra
yes, but why is it not a good idea to join it !! I have in the table NEWS 2 cols (uId, fId) and of course other cols, and I use JOIN to get the name for each one ... !!?
Rawhi
so if any of these tow has changed his name I can get the current name ... you have something better !?!
Rawhi
yes, no data duplication ,if you ever change any name or detail for a user, you would have to iterate through every single record in the news table and change it, it is in my opinion very bad design. In all cases i have used a news table, i have only need to join about 20 records (which is very quick as UID is indexed because its the PK and an int) so therefore the join part isn't the time/cpu intensive process, normally it is the filtering of news (ordering by date, text searching etc...)
LnDCobra
Joining users to news means if there is a major news item that 25,000 users want to see, you'd have to have the same news item repeated.
Jeff O
A: 

Depends what the performance bottleneck is. One problem is often using the wrong technology for the problem, eg using a relational DB when an object DB or document store would be better, or vice versa of course.

Some people try and use the same DB for everything which is not always the answer. Sometimes it is useful to have multiple denormalizations of the same data for different purposes.

Thinking about the nature of the data and how it is written, read, queried etc is important. You can put all write-once data in one DB and optimize that db for that. Other data that is written frequently could be stored on a db optimized for that.

Distribution techniques can also assist with upscaling.

Kurt
+1  A: 

According to link text and other pages Facebook uses a technique called Sharding.

It simply uses a bunch of databases with a small portion of the site on each database. A simple algorithm for deciding which database to use could be using the first letter in the username as an index for the database. One database for 'a', one for 'b', etc. I'm sure Facebook has a more advanced scheme than that, but the principle is the same.

The result is many small independent databases that are small enough to handle the load. Facebook and all other major sites has all sorts of similar tricks to make the sites fast and responsive.

They continuously monitor the sites for performance and other metrics and come up with solutions to the issues the find.

I think the monitoring part is more important to the performance success than the actual techniques used to gain the performance. You can not make a fast site by blindly throw some "good performance spells" at it. You have to know where and why you have bottlenecks before you can remove them.

Albin Sunnanbo
+2  A: 

When you talk about scaling at the size of Facebook, is a whole different ball park. Latest estimates put Facebook datacenter at about 60000 servers (sixty thousand). Only the cache is estimated to be at about 30 TB (terabytes) ina a masive Memcached cluster. Although their back end is stil MySQL, is used as a pure key-value store, according to publicly available information:

  • Facebook uses MySQL, but primarily as a key-value persistent storage, moving joins and logic onto the web servers since optimizations are easier to perform there (on the “other side” of the Memcached layer).

There are various other technologies in use there:

You can also watch this year SIGMOD 2010 key address Building Facebook: Performance at big scale. They even present their basic internal API:

cache_get ($ids,
    'cache_function',
    $cache_params,
    'db_function',
    $db_params);

So if you connect the dots you'll see that at such scale you no longer talk about a 'big database'. You talk about huge clusters of services, key-value storage partitioned across thousands of servers, many technologies used together and so on and so forth.

As a side note, you can also see a pretty good presentation of MySpace internals. Although the technology stack is completely different (Microsoft .Net and SQL Server based, with a huge emphasis on message passing via Service Broker) there are similar points in how they approach storage. To sum up: application layer partitioning.

Remus Rusanu