views:

52

answers:

1

I'm trying to learn database design by creating a twitter clone.. And I was wondering what's the most efficient way of creating the friends' timeline function. I am implementing this in Google App Engine, which uses Big Table to store the data. IIRC, this means very fast read speed(gets), but considerably slower page queries, and this also means considerably slower write speeds. Currently in my mind there are two methods, each with its setbacks:

For each user, there's a list structure that's their friends' timeline. Everytime someone makes a tweet, this structure gets updated for each of its followers. This method uses a lot of write operations, but for each user retrieving the list it will seem very fast.

or

For each user, calculate the friends' timeline dynamically by getting all the tweets of the people he's following, and do a merge of all the tweets to get a friends' timeline(since for each individual person the tweets are sorted chronologically). This might be slow if the person is following a lot of people.

Are there some other ways that I'm not aware of? Both of these methods seem like it will make the system choke up when the number of users increase.

+1  A: 

You need to focus on the object of the exercise, which you say is learning about database design. So don't get hung up on scalability. Design a database which works for you and your mates to use. Pretty much any design you pick will be able to handle that sort of load. Apart from anything else, the GAE license would start to charge you big bucks if you even started to approach Twitter-style levels of hits.

The thing is, scalability for players like Twitter and Facebook is a major part of their proposition. Consequently they expend a lot of effort in building their apps to scale. They do this with lots of optimizations, including different storage architectures for different types of data, distributed servers and caching, lots of caching. In other words, it's done with infrastructure and architecture, not database design

High Scalability is a very good source of relevant information. For instance, this summary of a presentation by Twitter's Evan Weaver last year is extremely pertinent:

"[E]verything in RAM now, database is a backup; peaks at 300 tweets/second; every tweet followed by average 126 people; vector cache of tweet IDs; row cache; fragment cache; page cache; keep separate caches; GC makes Ruby optimization resistant so went with Scala; Thrift and HTTP are used internally; 100s internal requests for every external request; rewrote MQ but kept interface the same; 3 queues are used to load balance requests; extensive A/B testing for backwards capability; switched to C memcached client for speed; optimize critical path; faster to get the cached results from the network memory than recompute them locally."

Hmmm, "Database is a back-up" only. Scary stuff (for a database guy like me).

APC