I would have implemented it on GAE, like so:
Each user will have a table containing the the tweets of the people they follow. This table will be keyed by (user, timestamp descending).
Each user also has a follower_ranges table, which maps a user to a set of contiguous follower id ranges. For most users, who have only a few thousand followers this table will have a single entry (-inf..+inf); this will be the implied default. For users with more followers, each range in the table will have a few thousand users. The ranges will be balanced over time to keep the number of users in each within some interval, e.g. larger than 1000, smaller than 10000. The union of all ranges will include all user ids.
Whenever a user -> follower operation is created, it is encoded as an action and added to a queue. Each element in the queue is a (sender, action, payload, follower subrange) tuple. Queue workers take an item, find all the followers in the given subrange, and apply the action to each of them. (Note that the action can be "add a tweet," "delete a tweet," "edit a tweet," etc. Basically anything that will need to be applied to all followers.)
Applying the queue action to each follower will involve issuing the corresponding writes and deletes to each user's tweet table. The barrier of the queue will mean that writes will not appear instantaneously, but it should possible to keep the delay below a few seconds.
Showing the user their tweets will be a cheap operation: "SELECT * FROM tweets WHERE user_id = :user_id ORDER BY (created_at DESC) LIMIT :max_per_page". This will scan a single table, and be a very fast operation. (Keeping user-blocking latency down is good!)
I think this design would scale rather well initially. Each component of the system can now be scaled up easily:
- The queue storage can be backed by GAE, and scaled as per any Datastore table
- The frontends can be scaled naturally, and there is no need for stickyness
- More queue processors can be added at any time
- The actual storage tables will grow naturally, and should scale fine on Datastore.
That said, I can think of a couple future improvements I would look into immediately:
- Reduce storage of rarely-shown data. This design denormalizes each tweet into a per-follower copy. However only the most recent tweets are usually accessed. By deleting the per-user copy of tweets after they are N days old, we can recover a lot of storage. If a user tries to view something from ancient history, we fetch the data from denormalized tables. This will be slower, but will not happen too often, and the savings will be significant. Storage savings: (#avg_followers - 1) / #avg_followers
- The write pattern is non-optimal. Across multiple queue items, each queue worker will be writing to every user's tweets table, thus the locality of writes will not be very good. (Worst case, we'll have #processor * #storage server connections.) This can be fixed by applying multiple updates to each range of users. For example, if two actions A and B are to be applied to range [0, 10000), then have a single queue processor apply those two actions at once.