I am working on a project involving monitoring a large number of rss/atom feeds. I want to use hbase for data storage and I have some problems designing the schema. For the first iteration I want to be able to generate an aggregated feed (last 100 posts from all feeds in reverse chronological order).
Currently I am using two tables:
Feeds: column families Content and Meta : raw feed stored in Content:raw
Urls: column families Content and Meta : raw post version store in Content:raw and the rest of the data found in RSS stored in Meta
I need some sort of index table for the aggregated feed. How should I build that? Is hbase a good choice for this kind of application?
Question update: Is it possible( in hbase) to design a schema that could efficiently answer to queries like the one listed bellow?
SELECT data FROM Urls ORDER BY date DESC LIMIT 100