Storing News in a Distributed DB vs RDBMS

Hi all: If I am storing News articles in a DB with different categories such as "Tech", "Finance", and "Health", would a distributed database work well in this system vs a RDBMS? Each of the news items would have the news articles attached as well as a few other items. I am wondering if querying would be faster, though.

Let's say I never have more than a million rows, and I want to grab the latest (within 5 hours) tech articles. I imagine that would be a map-reduce of "Give me all tech articles" (possibly 10000), then weed out only the ones that have the latest timestamp.

Am I thinking about tackling the problem in the right way, and would a DDB even be the best solution? In a few years there might be 5 million items, but even then....

Whether to use a distributed database or key-value store depends more on your operational requirements than your domain problem.

When people ask how to do time-ordered queries in Riak, we usually suggest several strategies (although none of them are a silver-bullet as Riak lacks ordered range queries):

1) If you are frequently accessing a specifically-sized chunk of time, break your data into buckets that reflect that period. For example, all data for the day, hour or minute specified would be either stored or linked to from a bucket that contains the appropriate timestamp. If I wanted all the tech news from today, the bucket name might be "tech-20100616". As your data comes in, add appropriate links from the time-boxed bucket to the actual item.

2) If the data is more sequence-oriented and not related to a specific point in time, use links to create a chain of data, linking backward in time, forward, or both. (This works well for versioned data too, like wiki pages.) You might also have to keep an object that just points at the head of the list.

Those strategies aside, Riak is probably not the 100% solution for up-to-the-minute information, but might be better for the longer-term storage. You could combine it with something like Redis, memcached, or even MongoDB (which has great performance if your data is mildly transient and can fit in memory) to hold a rolling index of the latest stuff.

ansaurus

tags:

views:

answers:

Storing News in a Distributed DB vs RDBMS

related questions