views:

77

answers:

1

Hi,

I'm thinking about using Cassandra for a large data project. The data will be sourced from a traditional data warehouse. Cassandra will host the data formated in a way my application can correctly read it.

I don't quite understand how I will prune the data from Cassandra.

For example, I want to count the number of visits a particular ip address has made to a website in the past 24 hours. I plan on generating this data every hour and I'd like to keep 2 weeks per IP address. My Column structure looks like:

127.0.0.1: {
  visitorsLast24Hours: {
    1279554672: 30,
    1279553072: 24,
    etc...
  }
}

How do I remove rows from the visitorsLast24Hours column?

So far, the best solution I've come up with is to:

  1. Get the column I want to work with
  2. Prune the values I no longer want to keep
  3. Delete the column from the database
  4. Re-insert the new pruned column

This seems like a poor method for working with the database. I'm assuming my data sizes will balloon, based on the way storage is done in Cassandra.

Is there a more efficient way of doing it?

I'm currently working with phpcassa as my interface to Cassandra.

Thanks!

+1  A: 

You actually don't have to delete and re-write the entire column. Assuming you're using a SuperColumn here, you can delete just a specified key from within the supercolumn (visitorsLast24Hours in this case). So you would traverse specific key values within the supercolumn that are older than your cutoff time, and delete each of those. With a supercolumn you don't have to re-write the entire dataset each time you add or delete a sub-row. Items of interest: http://wiki.apache.org/cassandra/API06 slicing and deleting.

Unoti