tags:

views:

52

answers:

2

I'd like to use Cassandra to store a counter. For example how many times a given page has been viewed. The counter will never decrement. The value of the counter does not need to be exact but it should be accurate over time.

My first thought was to store the value as a column and just read the current count, increment it by one and then put it back in. However if another operation is also trying to increment the counter, I think the final value would just be the one with the latest timestamp.

Another thought would be to store each page load as a new column in a CF. Then I could just run get_count() on that key and get the number of columns. Reading through the documentation, it appears that it is not a very efficient operation at all.

Am I approaching the problem incorrectly?

+1  A: 

I definitely wouldn't use get_count, as that is an O(n) operation which is ran every time you read the "counter." Worse than it being just O(n) it may span multiple nodes which would introduce network latency. And finally, why tie up all that disk space when all you care about is a single number?

For right now, I wouldn't use Cassandra for counters at all. They are working on this functionality, but it's not ready for prime time yet.

https://issues.apache.org/jira/browse/CASSANDRA-1072

You've got a few options in the mean time.

1) (Bad) Store your count in a single record and have one and only one thread of your application be responsible for counter management.

2) (Better) Split the counter into n shards, and have n threads manage each shard as a separate counter. You can randomize which thread is used by your app each time for stateless load balancing across these threads. Just make sure that each thread is responsible for exactly one shard.

3) (Best) Use a separate tool that is either transactional (aka an RDBMS) or that supports atomic increment operations (memcached, redis).

Ben Burns
A: 

What I ended up doing was using get_count() and caching the result in a caching ColumnFamily.

This way I could get a general guess at the count but still get the exact count whenever I wanted.

Additionally, I was able to adjust how stale the data I was willing to accept on a per request basis.

Stephen Holiday