tags:

views:

50

answers:

3

Let's say I have 500 RSS feeds that need updating very constantly but do not want to check all 500 every minute. What approach or algorithm can best determine which feeds should be updated while others are left for a later time?

Assume I can and will save historical data/stats, and that update frequency varies even within the same feed.

+1  A: 

Well, you partially answered your question yourself.

Use existing statistics to prioritize feeds basing on their update rate. Keep statistics itself updated, so it will adjust itself to the changes in update frequency.

alex
A: 

You're describing the very common problem of priority scheduling. There are a lot of possible approaches, but here's a simplified version.

  • Devise a priority function that determines how important it is that a particular feed be updated. (For example, if it's only been a little while since a particular feed has been updated and it has a low historical update rate, it would probably have a low priority.)
  • Then place the feeds in a priority queue.
  • When you need more work to do, take the highest-priority feed from the queue.
  • To ensure that everybody gets a chance to update, periodically elevate the priority of feeds that haven't been updated in a while.
John Feminella
A: 

If you want I high-end approach, you can work like this. Select a probabilistic model for the RSS feeds, for example that the time between updates follows a continuous probability distribution, for example the exponential distribution. For every RSS feed, use the maximum likelihood method for estimating the parameters of the individual distributions based on the history of updates on that feed. Now you have a probabilistic model which you can use the calculate the probability that any particular RSS feed has an update available at any specific time. Whenever you have time slot available to check for updates on one stream, check the one that has new data available with highest probability. The exponential probability, for example, is memoryless, which means that if you check the feed for update and there is none, the probability that it has an update will "reset" to 0% at the time of the check and will then grow upwards from there, prioritizing the other feeds in the immediate future over this one.

antti.huima