tags:

views:

50

answers:

2

I am trying to build a twitter-like follow mechanism. User takes an action. We make a list of all that users' followers, and then populate all of their streams with some information. Since this can take some time (if you have 10,000 followers that's 10,000 streams to insert information in, ie. 10,000 SQL calls perhaps), I want to make sure that this is done in the background, while the user that takes the action can go on with his life.

So, the strategy I'm considering is this:

  • user takes action.
  • php script opens another php script that will do all the work and that might take a second or two.
  • meanwhile, the user that took the action can go on with their life, their script just goes on and it's fast.

Thoughts? I also played around with using a queue, something like SQS, but this approach sounds like it might also work? Plus it has the advantage (for me) that it's easier to test locally and easier to run on non ec2 hosts.

And if this is a good approach, how would I open a php script from within a php script? Could it be as simple as (if the php script lives at a url) doing a get on a url where that script lives?

+3  A: 

The way this is being described sounds like you want to replicate/duplicate the first user's post for everyone who follows that user? That's going to be a data storage nightmare.

You should look at it from the other point of view. Consider the following model:

User A posts what he ate for breakfast. This is stored once in a table with his user id.

User B looks at his "stream". This is a dynamically created list of posts. At this point User B is following 50 people. User B's script will get the first 50 most recent posts of whomever he is following and display them for him on his "stream"

With this model, you never have more than one post per user per frivolous breakfast update. Also, the number of followers does not scale up the processing time needed to publish the twit. I mean tweet.

Clarification

You'll just need to do some normalization. So you'll have a users table, a users_following table, and a posts table. The query would look akin to:

SELECT posts.* FROM users_following
         LEFT JOIN posts ON posts.user_id = users_following.followed
         WHERE users_following.follower = $idOfUserB
         ORDER BY posts.created LIMIT 50;
Stephen
User B logs in. User B is following 200 people. Now we have to run a query that says "select * from content where authorid in ({all the ids of those 200 people}) order by date_created desc". Or 1000 people. That doesn't scale, at all, that query is going to get really slow. The data storage you mention isn't actually a problem, I won't duplicate the data for each follower, I'll just duplicate a pointer to the data (an id) for each follower, and that's a trivial amount of storage. So then we just do 2 fast queries to show your stream: get the list of content ids, and get the content. Thoughts?
I guess I'm trying to avoid that query "SELECT * FROM content WHERE authorid IN (a really long list)", because I suspect it will be pretty slow, especially if you are following a lot of people.
Right. I'm not doing an IN. IN is very slow, as you said. Check out my query.
Stephen
Ooops! I made a mistake in my query. I just updated it to make more sense.
Stephen
Awesome. Back to SQL :) This will be fast even if you're following 500 people?
Absolutely. The limit 50 ensures it. Just make sure you put indexes on your id columns and foreign id columns.
Stephen
If you don't use memcached with MySQL you are going to be in serious pain.
Alfred
A: 

If you want your site to scale at all.

  • First you need to use a message queue like for example redis/beanstalkd/gearmand.
  • Second you need to do your operations in memory using for example redis/memcached. Make sure you have enough memory to keep active dataset in memory.

(if you have 10,000 followers that's 10,000 streams to insert information in, ie. 10,000 SQL calls perhaps)

10,000 SQL calls has fail whale written over it. I would not use MySQL(Or at least use it with memcached) for such an application but use redis. Keep active dataset in memory. Keep datamodel as simple as possible.

And if this is a good approach, how would I open a php script from within a php script?

Don't do that. Add messages to redis's blocking list via lpush and read them via blpop(daemon process). I would first populate list of online users and next populate list of offline users. Offline users don't mind delay because they aren't online. You would put reference to key on list of all users following that person and get all the keys via mget.

Could it be as simple as (if the php script lives at a url) doing a get on a url where that script lives?

Again don't call urls but use message queue. Calling url will give you undesired overhead.

Awesome. Back to SQL :) This will be fast even if you're following 500 people? –

SQL will give you fail whales big time on high load. At least you will need memcached! But I would use redis instead.

Alfred