views:

185

answers:

3

I'm using RSS library so i can parse Atom and RSS in Ruby and Rails and store it in a model. I've looked at the standard RSS library, but is there one library that will auto-detect that there is a new rss feed so i can update my database ?

what are the best practice to trigger an instruction in order to store the new rss feed ? should i use threads to handle that problem ?is it going to be slow? thank you for your help

+1  A: 

I am not sure what you mean by "auto-detect" a new feed?

Are you looking for code that can discover when someone creates a new feed on a site? Or, do you mean discover when an existing feed has a new article?

The first is tough because your code needs to know what site to look at so it needs some sort of auto-discovery of sites with new feeds. Searching the google for "new rss feeds" doesn't return anything that looks useful, at least not on the first page. If you, or your users, know of a new site then you can have an interface to add new sites to search. Then you grab the page at that URL, look for the RSS/Atom auto-discovery links, and go from there. Auto-discovery links can open a can of worms because of duplicate content being served using different protocols (RDF, RSS and Atom), so you have to determine which to use, or multiple feeds with alternate content listed.

If you mean you want to discover when an existing feed has new articles, then you have to keep track of the last time your code looked at the feed, and the last article that was seen, then retrieve the feed and see if any articles were not in your list of previously seen articles. Your code needs to be sensitive to the time-to-live information in a lot of feeds too. Hitting the feed every fifteen minutes when they update once a week is bad form. Most aggregation code can do those things already but you might need to configure a database and tell the code how to find it.

Generally, for this sort of task I set up a crontab entry on a production Linux or Unix system and fire off the job periodically, looking in the database for feeds whose last-run-time plus the stored time-to-live value is in the past.

Does that help any?

Greg
+4  A: 

OK heres the deal.

  1. If you want a real fast feed parser go for Feedzirra. Does not work on windows. http://github.com/pauldix/feedzirra

  2. Autodiscovery?

    -Theres truffle-hog if you don't want to do GET redirects. http://github.com/pauldix/truffle-hog

    -Theres feedbag if you want to do GET redirects to find feeds from given urls. This is slower though. http://github.com/damog/feedbag

  3. Feedzirra is the best bet if you want to poll for new entries for your feed. But if you want a more non-polling solution to your problem then i would suggest going through the pubsubhubbub spec. Make sure while parsing your feeds they are pubsubhubbub enabled. Check for the link tag. If it points to pubsubhubbub.appspot.com or any other pubsub enabled hub then just subscribe to the feed by sending a subscription request to the hub. You can then define a endpoint in your app which will in turn receive updated entry pings for your feed subscription from the hub. Just read the raw POST data and store it in your database. Stats are that 95% of the blogger blogs are pubsub enabled. That is a lot of data in your hands already. :)

  4. If you are polling for changes then you should check the last-modified or etag from the header rather than parse the entire feed again. Saves you from wasting resources. Feedzirra takes care of this for you.

Shripad K
I forgot about doing a head() on the URL and looking for the etag and last-modified headers. I had to write an aggregator about two years ago and was going from my highly damaged memory. +1 for your response!
Greg
Thank you Greg :)
Shripad K
A: 

Veary easy solution is to use Dynamic attribute-based finders

When you are filling your model with RSS feed data, instead of Model.create(...) use Model.find_or_create_by_column(value, :other_column => other_value).

You can specify a date as unique value or RSS message title ... (whatever you want)

I think this is pretty easy. You can make some cron task to fill your model once per hour for example. Only new feeds will be added.

There is no chance to get some "event" when RSS is updated without downloading whole RSS feed again.

retro
No you are wrong. You can just fetch the header of the RSS feed rather than download the entire feed. The header contains the e-tag or the last-modified tag with which you can compare the one already stored in your database. Only if theres any updation the entire feed can be downloaded.
Shripad K
And you can get an "event" when the RSS is updated by subscribing to a pubsub enabled server. Read my first answer. You receive the updated feed entry in the form of fat pings. Then you can just read the raw POST data and extract the content. Read up the pubsubhubbub spec.
Shripad K
We are talking about simple general RSS feed, not about pubsub enabled server! Your solution is not general RSS feed solution.
retro