views:

3995

answers:

8

Hi,

I'm interested in hearing your opinions in which is the best way of implementing a social activity stream (Facebook is the most famous example). Problems/challenges involved are:

  • Different types of activities (posting, commenting ..)
  • Different types of objects (post, comment, photo ..)
  • 1-n users involved in different roles ("User x replied to User y's comment on User's Z post")
  • Different views of the same activity item ("you commented .." vs. "your friend x commented" vs. "user x commented .." => 3 representations of a "comment" activity)

.. and some more, especially if you take it to a high level of sophistication, as Facebook does, for example, combining several activity items into one ("users x, y and z commented on that photo"

Any thoughts or pointers on patterns, papers, etc on the most flexible, efficient and powerful approaches to implementing such a system, data model, etc. would be appreciated.

Although most of the issues are platform-agnostic, chances are I end up implementing such a system on Ruby on Rails

+2  A: 

I think Plurk's approach is interesting: they supply your entire timeline in a format that looks a lot like Google Finance's stock charts.

It may be worth looking at Ning to see how a social networking network works. The developer pages look especially helpful.

warren
+7  A: 

The biggest issues with event streams are visibility and performance; you need to restrict the events displayed to be only the interesting ones for that particular user, and you need to keep the amount of time it takes to sort through and identify those events manageable. I've built a smallish social network; I found that at small scales, keeping an "events" table in a database works, but that it gets to be a performance problem under moderate load.

With a larger stream of messages and users, it's probably best to go with a messaging system, where events are sent as messages to individual profiles. This means that you can't easily subscribe to people's event streams and see previous events very easily, but you are simply rendering a small group of messages when you need to render the stream for a particular user.

I believe this was Twitter's original design flaw- I remember reading that they were hitting the database to pull in and filter their events. This had everything to do with architecture and nothing to do with Rails, which (unfortunately) gave birth to the "ruby doesn't scale" meme. I recently saw a presentation where the developer used Amazon's Simple Queue Service as their messaging backend for a twitter-like application that would have far higher scaling capabilities- it may be worth looking into SQS as part of your system, if your loads are high enough.

Tim Howland
Tim, do you by any chance remember the name of the presentation or the presentator?
Danita
it was at Oreilly and Associate's Ignite Boston presentation either number 3 or 4- I believe the presenter had a book on scaling RoR with Oreilly. Sorry I can't be more specific!
Tim Howland
Thanks Tim :) By the way, what did you mean with "smallish social network"? How many users, or active users at a certain time?
Danita
In case anyone needs it, I think this is the presentation Tim is talking about: "Dan Chak -- Scaling to the Size of your Problems"http://radar.oreilly.com/2008/09/ignite-boston-4----videos-uplo.html
Danita
Smallish in this case is such that "select * from events where event.is visible for this user" returns a result in less than a second or two- figure a few hundred thousand rows worth of events.
Tim Howland
+21  A: 

I have created such system and I took this approach:

Database table with the following columns: id, userId, type, data, time.

  • userId is the user who generated the activity
  • type is the type of the activity (i.e. Wrote blog post, added photo, commented on user's photo)
  • data is a serialized object with meta-data for the activity where you can put in whatever you want

This limits the searches/lookups, you can do in the feeds, to users, time and activity types, but in a facebook-type activity feed, this isn't really limiting. And with correct indices on the table the lookups are fast.

With this design you would have to decide what metadata each type of event should require. For example a feed activity for a new photo could look something like this:

{id:1, userId:1, type:PHOTO, time:2008-10-15 12:00:00, data:{photoId:2089, photoName:A trip to the beach}}

You can see that, although the name of the photo most certainly is stored in some other table containing the photos, and I could retrieve the name from there, I will duplicate the name in the metadata field, because you don't want to do any joins on other database tables if you want speed. And in order to display, say 200, different events from 50 different users, you need speed.

Then I have classes that extends a basic FeedActivity class for rendering the different types of activity entries. Grouping of events would be built in the rendering code as well, to keep away complexity from the database.

heyman
this is a really great system. I assume that you are creating the feed database entries at the same time you actually performing the action, for example, creating a new comment event entry in the feed table at the same time the user submits the comment
Ryan Max
Yep, that's correct. Lately I've been using MongoDB (http://mongodb.org) in a few projects, whose schemaless approach makes it very suitable for creating a well performing social activity stream that follows this design.
heyman
Wait, but you have userID:1, you'll still need a join to grab the user name?
AnApprentice
TheApprentice: Yep, you might want to throw in a username field as well. In our system, we only displayed events generated by a user's friends, and I believe we already had a map of the friends' userid->username in memory, so looking up the usernames didn't require a JOIN and were fast.
heyman
+2  A: 
// one entry per actual event
events {
  id, timestamp, type, data
}

// one entry per event, per feed containing that event
events_feeds {
  event_id, feed_id
}

When the event is created, decide which feeds it appears in and add those to events_feeds. 
To get a feed, select from events_feeds, join in events, order by timestamp.
Filtering and aggregation can then be done on the results of that query.
With this model, you can change the event properties after creation with no extra work.
jedediah
+2  A: 

I had a similar approach to that of heyman - a denormalized table containing all of the data that would be displayed in a given activity stream. It works fine for a small site with limited activity.

As mentioned above, it is likely to face scalability issues as the site grows. Personally, I am not worried about the scaling issues right now. I'll worry about that at a later time.

Facebook has obviously done a great job of scaling so I would recommend that you read their engineering blog, as it has a ton of great content -> http://www.facebook.com/notes.php?id=9445547199

I have been looking into better solutions than the denormalized table I mentioned above. Another way I have found of accomplishing this is to condense all the content that would be in a given activity stream into a single row. It could be stored in XML, JSON, or some serialized format that could be read by your application. The update process would be simple too. Upon activity, place the new activity into a queue (perhaps using Amazon SQS or something else) and then continually poll the queue for the next item. Grab that item, parse it, and place its contents in the appropriate feed object stored in the database.

The good thing about this method is that you only need to read a single database table whenever that particular feed is requested, rather than grabbing a series of tables. Also, it allows you to maintain a finite list of activities as you may pop off the oldest activity item whenever you update the list.

Hope this helps! :)

+3  A: 

If you do decide that you're going to implement in Rails, perhaps you will find the following plugin useful:

ActivityStreams: http://github.com/face/activity_streams/tree/master

If nothing else, you'll get to look at an implementation, both in terms of the data model, as well as the API provided for pushing and pulling activities.

Alderete
+1  A: 

I started to implement a system like this yesterday, here's where I've got to...

I created a StreamEvent class with the properties Id, ActorId, TypeId, Date, ObjectId and a hashtable of additional Details key/value pairs. This is represented in the database by a StreamEvent table (Id, ActorId, TypeId, Date, ObjectId) and a StreamEventDetails table (StreamEventId, DetailKey, DetailValue).

The ActorId, TypeId and ObjectId allow for a Subject-Verb-Object event to be captured (and later queried). Each action may result in several StreamEvent instances being created.

I've then created a sub-class for of StreamEvent each type of event, e.g. LoginEvent, PictureCommentEvent. Each of these subclasses has more context specific properties such as PictureId, ThumbNail, CommenText, etc (whatever is required for the event) which are actually stored as key/value pairs in the hashtable/StreamEventDetail table.

When pulling these events back from the database I use a factory method (based on the TypeId) to create the correct StreamEvent class.

Each subclass of StreamEvent has a Render(context As StreamContext) method which outputs the event to screen based on the passed StreamContext class. The StreamContext class allows options to be set based on the context of the view. If you look at Facebook for example your news feed on the homepage lists the fullnames (and links to their profile) of everyone involved in each action, whereas looking a friend's feed you only see their first name (but the full names of other actors).

I haven't implemented a aggregate feed (Facebook home) yet but I imagine I'll create a AggregateFeed table which has the fields UserId, StreamEventId which is populated based on some kind of 'Hmmm, you might find this interesting' algorithm.

Any comments would be massively appreciated.

jammus
I am working on a system like this am very interested in any knowledge on it, did you ever finish yours?
jasondavis
A: 

I solved this a few months ago, but I think my implementation is too basic.

I created the following models:

HISTORY_TYPE

ID - The id of the history type NAME - The name (type of the history) DESCRIPTION - A description

HISTORY_MESSAGES

ID HISTORY_TYPE - A message of history belongs to a history type MESSAGE - The message to print, I put variables to be replaced by the actual values

HISTORY_ACTIVITY

ID MESSAGE_ID - The message ID to use VALUES - The data to use

Example

MESSAGE_ID_1 => "User %{user} created a new entry"

ACTIVITY_ID_1 => MESSAGE_ID = 1, VALUES = {user: "Rodrigo"}

Rodrigo