tags:

views:

387

answers:

7

I have an application that polls several rss sources on the web.

What is the etiquette when polling other's web servers. How frequently to poll, etc?

What are the best practices?

+1  A: 

Once an hour is a frequency I've heard.

altCognito
+4  A: 

Google's FeedFetcher claims it polls rss feed slightly less than once per hour.

From: http://code.google.com/apis/ajaxfeeds/documentation/

Feed Crawl Frequency

As the Google AJAX Feed API uses Feedfetcher, feed data from the AJAX Feed API may not always be up to date. The Google feed crawler ("Feedfetcher") retrieves feeds from most sites less than once every hour. Some frequently updated sites may be refreshed more often.

Jonathan Fingland
+1 for the reference
altCognito
+2  A: 

Once an hour, if you want to just go by rule-of-thumb (but the link explains some better options).

Bill the Lizard
+1 for the reference
altCognito
A: 

Rss has a ttl setting in it so really you should only poll when the TTL expires.

But I guess if they don't put one in its their problem and you should poll something like once an hour

Sruly
A: 

Well I'm going to go out there, ignoring the posts that say "Google says, we do", and say: as often as you realistically need to.

RSS is there to keep you up to date. If a feed publishes 10 items an hour but only shows five, you'll miss five of those items and the feed isn't serving its purpose. You might as well not hit it at all.

Of course, you can't hammer the server with requests but if they're publishing enough to have you requesting once a minute, I don't see how it's unreasonable to match that rate.

Oli
you'll note that the google reference also points out that they use a higher rate for frequently updated feeds.
Jonathan Fingland
My point (that I'll agree wasn't best put across considering I didn't read the quote through) is that Google isn't neccessarily the be all and end all of best practices or ethics.
Oli
+12  A: 
  1. Make use of HTTP cache. Send Etag and LastModified headers. Recognize 304 Not modified response. This way you can save a lot of bandwidth. Additionally some scripts recognize the LastModified header and return only partial contents (ie. only the two or three newest items instead of all 30 or so).

  2. Don’t poll RSS from services that supports RPC Ping (or other PUSH service, such as PubSubHubBub). I.e. if you’re receiving PUSH notifications from a service, you don’t have to poll the data in the standard interval — do it once a day to check if the mechanism still works or not (ping can be disabled, reconfigured, damaged, etc). This way you can fetch RSS only on receiving notification, not every hour or so.

  3. Check the TTL (in RSS) or cache control headers (Expires in ATOM), and don’t fetch until resource expires.

  4. Try to adapt to frequency of new items in each single RSS feed. If in the past week there were only two updates in particular feed, don’t fetch it more than once a day. AFAIR Google Reader does that.

  5. Lower the rate at night hours or other time when the traffic on your site is low.

  6. At last, do it once a hour. ;)

Maciej Łebkowski
+1 Some excellent points.
altCognito
#2 isn't necessarily a good idea. The site publishing the RSS feed would have to be configured to ping the feed fetcher for it to work.
ceejayoz
yes, ceejayoz, i meant exactly that. edited my answer a little
Maciej Łebkowski
Technorati link broken. The new one (http://technorati.com/ping) says they no longer accept pings because they suck.
dfrankow
thanks @dfrankow — changed the link and added some info on PUSH notifications in general
Maciej Łebkowski
A: 

This is not a complete answer, but look for push alerts.

The RSS blog indicates that a best practice is asking weblogs.com about changed blogs.

There is also some, er, hubbub, about pubsub, a way to subscribe to push alerts that has some momentum.

dfrankow