views:

967

answers:

4

social networking website probably maintain tables for users, friends and events.

how do they use these table to compute friends events in an efficient and scalable manner?

A: 

For the small scale doing a join on users.friends and users.events and query caching is probably fine but does slow down pretty quickly as friends and events grow. You could also try an event based model in which every time a user creates an event an entry is created in a join table (perhaps called "friends_events"). Thus whenever a user wants to see what events their friends have created they can simply do a join between their own id and the friends_events table and find out. In this way you avoid grabbing all a users with friends and then joining their friends with the events table.

Eric Scrivner
+1  A: 

The mainstay data structure of social networking sites is the graph. On facebook the graph is undirected (When you're someone's friend, they're you're friend). On twitter the graph is directed (You follow someone, but they don't necessarily follow you).

The two popular ways to represent graphs are adjacency lists and adjacency matrices.

An adjacency list is simply a list of edges on the graph. Consider a user with an integer userid.

User1, User2
  1      2
  1      3
  2      3

The undirected interpretation of these records is that user 1 is friends with users 2 and 3 and user 2 is also friends with user 3.

Representing this in a database table is trivial. It is the many to many relationship join table that we are familiar with. SQL queries to find friends of a particular user are quite easy to write.

Now that you know a particular user's friends, you just need to join those results to the updates table. This table contains all the user's updates indexed by user id.

As long as all these tables are properly indexed, you'd have a pretty easy time designing efficient queries to answer the questions you're interested in.

Jason Punyon
+9  A: 

Many of the social networking sites like Twitter don't use an RDBMS at all but a Message Queue application. A lot of them start out with a already present application like RabbitMQ. Some of them get big enough they have to heavily customize or build their own. Twitter is in the process of doing this for the second time.

A message queue application works by holding messages from one service for one or more other services. For instance say service Frank is publishing messages to a queue foo. Joe and Jill are subscribed to Franks foo queue. the application will keep track of whether or not Joe or Jill have recieved the messages and once every subscriber to the queue has recieved the message it discards it. Frank fires messages and forgets about it. Joe and Jill ask for messages from foo and get whatever messages they haven't gotten yet. Joe and Jill do whatever they need to do with the message. Perhaps keeping it around perhaps not.

The message queue application guarantees that everyone who is supposed to get the message can and will get the message when they request them. The publisher can send the messages confident that subscriber can get them eventually. This has the benefit of being completely asynchronous and not requiring costly joins.

EDIT: I should mention also that usually the storage for these kind of things at high scale are heavily denormalized. So Joe and Jill may be storing a copy of the exact same message. This is considered ok because it helps the application scale to billions of users.

Other reading:

  1. http://www.rabbitmq.com/
  2. http://qpid.apache.org/
Jeremy Wall
+1  A: 

Travis wrote a great post on this ,

Activity Logs and Friend Feeds on Rails & pfeed http://www.travisdunn.com/activity-logs-and-friend-feeds-on-rails-pfeed