tags:

views:

55

answers:

1

Hello guys, I wonder how web applications like Google Reader, Blogline, techronati works, and what technics they follow to parse millions of RSS feeds using cron job at one time?.

Good evening, Thank you.

+3  A: 

There is a lot of different techniques... the "worst" one being the one that you describe. (time based polling).

The first thing you need to consider is that they may not all do the parsing on the server side. For example, I know that Netvibes was doing the parsing on the client side (but cached the content on the server), so it saved them a lot of resources. This way they would poll feeds only when users asked from them, so there is no need for them to run some kind of time loop.

Time based polling is still, unfortunately the most frequent solution. There are a lot of techniques to determine when is the best time to do a poll. Based on the frequency of past updates, based on the number of users who susbcribed... etc. The old XML-RPC ping servers can also be used by these guys.

The most efficient technique is to use PubSubHubbub, which is a open protocol used by Google Reader, Netvibes and a few thousand other apps (like Digg.com, Twitterfeed, Friendfeed...). It's an open protocol that allows the feed publisher to directly push the content of the feed to subscribing applications. It's very efficient, but requires the publisher to implement it. By chance, all the big blogging platforms (Tumblr, Posterous, Wordpress, Blogger, SixApart... etc) have implemented it. Other feed publishing apps (like feedburner, Gowalla, ...) also implemented it. If you do publish feeds, I would encourage joining this crowd, and if you plan on consuming some, please, implement the susbcriber side as well.

The last solution is to use a 3rd party application do this data gathering (using all the techniques above) and ping you when these feeds actually have new content. I created one : Superfeedr and I believe we do a good job with this. We also normalize the content and do a few other things to help you consume feed data in the simplest and cheap way (polling can be crazy expensive). Also, we use the exact same PubSubHubbub protocol to push content from any feed, which makes it very simple for our users to use our service in addition to subscribing to available hubs.

Also, I should add that I was able to reply quickly to your question, because I use an app that pushes me the content of the feed for questions tagged RSS :)

Julien Genestoux
Nice to read your Answer Julien thank you so much ;)