views:

31

answers:

1

In my web application, my users have many events. One such event is "user updated facebook status." A user could have hundreds of that type of event, and there are 10 types of events. I need to display event counts and other user statistics based on events in a very scalable manner. This is because each user will be able to see his or her statistics. We obviously cannot afford to run each calculation every time a user hits the website so caching these statistics is certainly the way to go.

Does it make sense to create a separate table for "statistics" which would have a user id and a column for each statistic we are caching? The table is essentially acting so I was wondering if there was a better or different way to do it.

+1  A: 

You could just calculate these stats and put them in memcache, reading/incrementing them as required, as this data doesn't need to persist (there would be a server load spike from with a cold cache that you might consider rate limiting logins/calculations etc). However this scenario is an ideal candidate for a non-relational data stores such as Cassandra (a "highly scalable, eventually consistent, distributed, structured key-value store"). This internal Digg article is a very interesting read:

The fundamental problem is endemic to the relational database mindset, which places the burden of computation on reads rather than writes. This is completely wrong for large-scale web applications, where response time is critical. It’s made much worse by the serial nature of most applications. Each component of the page blocks on reads from the data store, as well as the completion of the operations that come before it.

Non-relational data stores reverse this model completely, because they don’t have the complex read operations of SQL. The model forces you to shift your computation to the writes, while reducing most reads to simple operations – the equivalent of SELECT * FROM Table.

Andy
Cassandra is very interesting, thanks for the link. I am researching
Tony
what is the this mean - "Each component of the page blocks on reads from the data store, as well as the completion of the operations that come before it."
Tony
The code is processed sequentially (i.e. line by line) so slow operations slow everything below them. "blocks on reads" means waits for the data to be returned.
Andy
Gotcha. It seems like Cassandra might be overkill for my situation because the main point seems to be preventing complex joins and read time. In my case, I really just need to cache a count of each type of event. For example, I have event type "user updated facebook status." There may be 50 million rows in the event table and I need to query a COUNT by userid and event_type_id so I can say "this user registered this event 43 times" So the problem is not so much a complicated query, but an expensive operation. Do you think Cassandra is still worth consideration?
Tony