views:

78

answers:

1

Assume the database has tables for Users, Feeds, Items, and an ability to know which Items the user has already seen. I am looking for a design paradigm that can be used on the server to compute in a short amount of time [feed id, num_unread] for each feed that the user has subscribed to.

Assume plenty of users and that the feeds are getting updated periodically in the backend.

Edit: I wanted to solve the problem Nick J has brought up (see below). But I appreciate the solution posted by cletus. I am not so worried about the db queries, but want a "design paradigm" -- like keeping a watchdog process that keeps the unread counts in memory so that it can be served at any point.

+1  A: 

I'm not sure what to tell you exactly because what you're asking is reasonably straightforward.

First off, use Google Reader as a reference for online feed aggregators/readers. And if you're trying to recreate the functionality, Google Reader has pretty much nailed it already (imho).

Google Reader works simply by storing a list of feeds. In DB terms, you'd probably have these entities

User: id, name, email, etc...
Feed: id, feed_name, feed_url
Content: id, feed_id, title, content
User Feed: id, user_id, feed_id, user_label, has_read

Unread items:

SELECT COUNT(1)
FROM user u
JOIN user_feed uf ON uf.user_id = u.id
JOIN feed f ON f.id = uf.feed_id
WHERE has_read = 0

Unread items by feed:

SELECT feed_id, feed_name, COUNT(1)
FROM user u
JOIN user_feed uf ON uf.user_id = u.id
JOIN feed f ON f.id = uf.feed_id
WHERE has_read = 0
GROUP BY feed_id, feed_name

And then you just need some mechanism for marking items as read. In Google Reader's case, there are just AJAX calls triggered by mouseover events with additional links to mark everything as read, leave an item marked as unread and so on.

cletus
The problem you're missing is that counting unread items at runtime is extremely inefficient - O(n) for number of unread items - and thus impractical at large scale. The approach Reader takes to this is to put a cap on the number of unread items it will count.
Nick Johnson
I'm not missing it at all. Its just not relevant. The number of items you can count at runtime with the above scheme is huge in real terms and a modern database will typically just use the index rather than reading the rows to get counts of unread items. Don't micro-optimize. The above is a solid design.
cletus
I'll just add that its really important you get a solid design first. Then and only then do you optimize it if and only if you need to. In this case, you could denormalze the model by storing unread counts on User Feed but don't start out that way. Doing that means you have to deal with issues of updating values in two tables when an item is read and the values can get out of sync.
cletus
It's not micro-optimization. Even with an index, counting up all the items at runtime when there's potentially thousands (spread across hundreds of feeds) adds significant overhead to rendering a request. Precalculating counts when possible is much better, and failing that, specifying a reasonable cap like reader's '1000+' is a sensible alternative.
Nick Johnson
You're both second-guessing the database before you've even identified a problem, which is ALWAYS a mistake. If you actually find a problem then and only then should you go caching values. You can alter the above model to do that IF REQUIRED. Secondly, by saying thousands of unread items, you're not being realistic about how users will use the system.
cletus