views:

163

answers:

5

I'm going to work on an inspiration site for a NGO, and I'm looking to implement some sort of Facebook-esque event stream, with events like “Michael recommended apple pie”, “John commented on chocolate cake”, “Caramel fudge was posted 8 hours ago by Alice”, etc.

The thing is that these events are interest-based, so someone only be interested in caramel and cherries and should not see apple pies or chocolate cakes then. There are a lot of permutations for this, and generating a user’s personalized event stream on the fly would mean some rather expensive database-queries.

So my thinking was to pre-generate a relation between the receiving user and the posted event (probably a simple SQL JOIN-table) by doing some sort of background processing whenever an action event is happening.

The work required to weigh the preferences of hundreds of users against an event is bound to be substantial, so it cannot be done as a part of POST request that triggers the work, so I’ll have to do a lot of the work in a different process. I’m currently looking at Gearman for this task, but I’m very open to suggestions.

I’m not looking for someone to do my work for me, but if anyone has any prior experience with building this sort of thing, I'd love to hear your thoughts.

+2  A: 

I have had some experience of building a news stream on a social networking site and yes, queries can get very complex very quickly when you have multiple types of events and multiple levels of interest (or privacy settings, or user permissions).

On the assumption that events are viewed more often than they are generated, it does make sense to do some denormalisation and calculate an event's potential viewers when the event happens, rather than every time somebody requests the news stream.

I would suggest running a background process which converts these event objects (related to their creators) into simpler message objects (related to their reader, the people who see them on the news stream). You may end up with many messages per event, but this will make requests to to the front-end much quicker, and offload the work onto the background processes.

I've not used Gearman, but if it is the sort of thing which allows you to load up your app's environment in a background process and receive the events to process through a queue, then it's probably a good idea.

My simple solution was to roll my own using beanstalkd and my own PHP scripts.

Ben James
Okay, thanks for the suggestion, I'll take a look at Beanstalk :)
mikl
+1  A: 

Don't know how your DB is structured (you might want to tell us more), but something obvious like

SELECT events.* FROM events, event_tags, user_tags
     WHERE event_tags.event_id = events.id 
         AND event_tags.tag_id = user_tags.tag_id
         AND  user_tags.user_id = <$user_id>

doesn't seem extremely heavy to me, assuming you have indices all over the place

stereofrog
Yeah, it could be nice to be able to query the events directly, but given the amount of things that come into consideration (tags, keywords, author, recommendations, comments and usage patterns) I don't think it would be possible to build an efficient index while keeping flexibility for future changes. I'd love to be proven wrong, though.
mikl
+1  A: 

This sounds like something that can be solved with a proper index. I would build the solution around the presumption that the database is capable of handling it, but place a service in front of the database and let all clients go through this point. If things begin going too slow, you can introduce various types of caching in this layer. As with most performance decisions, trying to do it right up front is probably not a good idea.

troelskn
Given the complexity and amount of data to consider (see my reply to Stereofrog), I can't really see how such an index could be made, without having to do 15-30 JOINs… and going by the assumption that MySQL is able to handle a lot of heavy queries, would in my opinion be begging for trouble :)
mikl
+1  A: 

Facebook developed their own database to do this sort of thing and open sourced it, I don't know much about it but I'm guessing it might be worth a look.

Rob West
To my knowledge, Cassandra is mainly a storage mechanism for stream data, and not all that concerned with the generation of it.
mikl
+1  A: 

Have you looked at the activity module? http://drupal.org/project/activity

I'll be curious about what you come up with because need to do something like this in semi-near future.

Kyle Mathews
No, but I'll check it out, thank you :)
mikl