views:

94

answers:

3

This is a question about tradeoffs.

Imagine a social network. Each user has a status message, that he can change anytime. Whenever he does change it, all his friends are notified through a wall (like in Facebook).

To make this work. We have 3 tables Users(id, name), FriendLists(userId, friendUserId), Notifications(?).

Now let's assume each user has approximately 50 friends in his friend list. I am faced with the dilemma - how to implement Notifications table.


1st option

CREATE TABLE Notifications
(
toUserId bigint NOT NULL,
[identity] bigint IDENTITY(1,1) NOT NULL,
fromUserId bigint NOT NULL,
data varchar(256) NOT NULL,
CONSTRAINT [PK_Notifications] PRIMARY KEY CLUSTERED (toUserId, [identity])
)

Send notifications:

-- Get all friends of @fromUserId.
WITH Friends AS
   (SELECT FriendLists.friendUserId
 FROM FriendLists
 WHERE userId = @fromUserId)
-- Send updates to all friends.
SELECT
 friendUserId as toUserId,
 @fromUserId as fromUserId,
 @data as data
INTO Notifications
FROM Friends

In this case, for each status change we create 50 records (assuming 50 friends). This is bad. However the good point is that to retrieve notifications for a specific user it is really fast, since we have a clustered index on the toUserId.

2nd option

CREATE TABLE Notifications
(
toUserId bigint NOT NULL,
[identity] bigint IDENTITY(1,1) NOT NULL,
fromUserId bigint NOT NULL,
data varchar(256) NOT NULL,
CONSTRAINT [PK_Notifications] PRIMARY KEY CLUSTERED ([identity])
)
CREATE NONCLUSTERED INDEX [IX_toUserId] ON Notifications (toUserId ASC)

Send notifications:

-- Get all friends of @fromUserId.
WITH Friends AS
   (SELECT FriendLists.friendUserId
 FROM FriendLists
 WHERE userId = @fromUserId)
-- Send updates to all friends.
INSERT INTO Notifications(toUserId, fromUserId, data)
    VALUES(friendUserId, @fromUserId, @data)

Here we only insert a single record per status update. This is good. The bad point is that the retrieval of the notifications is going to be slower, since records are not clustered by toUserId.


Getting notifications is same for both methods:

SELECT TOP(50) fromUserId, [identity], data
FROM Notifications
WHERE toUserId  = @toUserId

So what is your take on this?

+1  A: 

Updates are very slow compared to SELECTs ... a few orders of magnitude. Plus, as your site scales you'll be caching all your fetches in memory, so the speed of selects will be trivial.

Jess
+1  A: 

In this situation, it seems like a bad idea to create a clustered index on (toUser,identity) because a clustered index really ought to be inserted in ascending order. Of course SQL will take care of keeping the table sorted but this comes at a high performance cost (which is the point of your question.) But in general, inserts that are known ahead of time to be in no particular order are not recommended for clustered indexes. Here is a very good three part article about clustered index recommendations.

Having said that, I'd stick with the identity column as your clustered index and create a nonclustered index on toUserId and maybe a datetime column. By including a datetime column, you can more efficiently query for recent data.

Regarding slow updates, status updates on social networking sites are a perfect situation for message queues. That way you can tune the database as needed to make reads fast and if it has an impact on write performance, the user won't have to suffer. From their perspective the update was instantaneous even though it might take a few moments to "stick".

For very large databases I'll defer to the SQL gurus who can talk about partitioning strategies (smaller more manageable tables for newer data, larger/heavily indexed tables for older data) and replication solutions.

Josh Einstein
+3  A: 

First, reads are always going to be overwhelming in comparison with writes, because each 'wall' will be seen many more times than it will be updated. So you better make reads darn fast.

Second, one of the problem inherent in these kind of big social networking sites is the distribution of data (sharding, partitioning, no single database will ever be capable of storing all accounts, all friends, all notifications) which means that when a new notification is put on a wall, the friends have to be notified on other servers. This implies updates are asynchronous and messaging based anyway.

So I would definitely go with a structure optimized for reading.

I'd recommend you go over the public presentations done by various people involved in the architecture of sites like Facebook and MySpace, like this Christa Stelzmuller's one. They explain a lot of the thinking and reasoning that goes into their design.

Remus Rusanu
Great link. Thank you. I also found this http://developer.yahoo.net/blogs/theater/archives/2008/01/nextyahoonet_big_data_viewpoints_from_the_fac.html and this http://www.infoq.com/presentations/Facebook-Software-Stack. If you have any more useful info on the topic, please share.
niaher