views:

381

answers:

2

I'm moving an existing MySQL based application over to Cassandra. So far finding the equivalent Cassandra data model has been quite easy, but I've stumbled on the following problem for which I'd appreciate some input:

Consider a MySQL table holding millions of entities:

CREATE TABLE entities (
  id INT AUTO_INCREMENT NOT NULL,
  entity_information VARCHAR(...),
  entity_last_updated DATETIME,
  PRIMARY KEY (id),
  KEY (entity_last_updated)
);

Every five minutes the table is queried for entities that need to be updated:

 SELECT id FROM entities 
  WHERE entity_last_updated IS NULL 
     OR entity_last_updated < DATE_ADD(NOW(), INTERVAL -7*24 HOUR)
  ORDER BY entity_last_updated ASC;

The entities returned by this queries are then updated using the following query:

 UPDATE entities 
    SET entity_information = ?, 
        entity_last_updated = NOW()
  WHERE id = ?;

What would be the corresponding Cassandra data model that would allow me to store the given information and effectively query the entities table for entities that need to be updated (that is: entities that have not been updated in the last seven days)?

+1  A: 

You'd have to scan all the rows and grab the timestamp from the column(s) you're interested in. If this is something you run every day or so, doing this in a Hadoop job should be fine. If it's something you run every few minutes, then you'll need to come up with another approach.

jbellis
Hi! The query is being issued once every five minutes. I've now updated my question with that info.
knorv
+2  A: 

To achieve what you described you need to have column name as time stamp and use get slice function using start time and endtime, it will give you all rows with column name with in that range. also use column name sort so you would get result in ordered by time.

mamu