views:

95

answers:

2

Situation:

I am currently designing a feed system for a social website whereby each user has a feed of their friends' activities. I have two possible methods how to generate the feeds and I would like to ask which is best in terms of ability to scale.

Events from all users are collected in one central database table, event_log. Users are paired as friends in the table friends. The RDBMS we are using is MySQL.

Standard method: When a user requests their feed page, the system generates the feed by inner joining event_log with friends. The result is then cached and set to timeout after 5 minutes. Scaling is achieved by varying this timeout.

Hypothesised method: A task runs in the background and for each new, unprocessed item in event_log, it creates entries in the database table user_feed pairing that event with all of the users who are friends with the user who initiated the event. One table row pairs one event with one user.

The problems with the standard method are well known – what if a lot of people's caches expire at the same time? The solution also does not scale well – the brief is for feeds to update as close to real-time as possible

The hypothesised solution in my eyes seems much better; all processing is done offline so no user waits for a page to generate and there are no joins so database tables can be sharded across physical machines. However, if a user has 100,000 friends and creates 20 events in one session, then that results in inserting 2,000,000 rows into the database.

Question:

The question boils down to two points:

  • Is this worst-case scenario mentioned above problematic, i.e. does table size have an impact on MySQL performance and are there any issues with this mass inserting of data for each event?
  • Is there anything else I have missed?
+1  A: 

I think your hypothesised system generates too much data; firstly on the global scale the storage and indexing requirements on user_feed seems to escalate exponentially as your user-base becomes larger and more interconnected (both presumably desirable for a social network); secondly consider if in the course of a minute 1000 users each entered a new message and each had 100 friends - then your background thread has 100 000 inserts to do and might quickly fall behind.

I wonder if a compromise might be made between your two proposed solutions where a background thread updates a table last_user_feed_update which contains a single row for each user and a timestamp for the last time that users feed was changed.

Then although the full join and query would be required to refresh the feed, a quick query to the last_user_feed table will tell if a refresh is required or not. This seems to mitigate the biggest problems with your standard method as well as avoid the storage size difficulties but that background thread still has a lot of work to do.

Elemental
But on the other hand, the `user_feed` table only contains two columns, `event_log_id` and `user_id` and the primary key is on both these columns. So each row is 8 bytes, so that's only 800KB for the scenario you describe. If it's a problem then this table could be stored on a totally separate server, or even split the table onto different servers for odd/even users. Sorry, just being Devil's Advocate, but I'm still not convinced.
SlappyTheFish
Also falling behind is not an issue, pages will still be served and if the data is old during peak times (which occur once a day) then it can catch up later. Ok, enough talking - I am going to do some tests.
SlappyTheFish
Understand your comments; I too would try some testing and see hwo it works in practice
Elemental
First quick test: I just tried 1000 events and 100 users for each (i.e. 100,000 inserts) on Amazon EC2 "Small Instance" with background load ~0.5 and it took 19s.
SlappyTheFish
So that was my issue - it seems that if you get moe load than this coming in every 19s then you background task will start to run behind.
Elemental
I agree, and it is possible that during peak times we will - so the background task can just delay each insert or even pause altogether until the peak has passed. Either way - it should have no influence on the serving of pages. What worries me most now is what you said about load increasing exponentially with users - which is a very valid point.
SlappyTheFish
A: 

The Hypothesized method works better when you limit the maximum number of friends.. a lot of sites set a safe upper boundary, including Facebook iirc. It limits 'hiccups' from when your 100K friends user generates activity.

Another problem with the hypothesized model is that some of the friends you are essentially pre-generating cache for may sign up and hardly ever log in. This is a pretty common situation for free sites, and you may want to limit the burden that these inactive users will cost you.

I've thought about this problem many times - it's not a problem MySQL is going to be good at solving. I've thought of ways I could use memcached and each user pushes what their latest few status items are to "their key" (and in a feed reading activity you fetch and aggregate all your friend's keys)... but I haven't tested this. I'm not sure of all the pros/cons yet.

Morgan Tocker