tags:

views:

64

answers:

5

Let's say I have a database, and an RSS feed. I have to find out what is the new data from an RSS feed, that isn't already in the database. How would you go about approaching this problem?

A: 

Pull from a unique field of a particular item in the rss feed. Then check to see if that item is already in the db. Run this logic in a loop.

rramirez
+1  A: 

Most RSS feeds will have a date with each story - so, make a query to pull the latest story's date from the database, pull all of the latest stories from the RSS feed, and compare dates.

It also depends on whether this is for one particular feed or if you are writing something that will work for many feeds. If it's supposed to work for all feeds, use one of the hashing methods; create a hash of the title and date and use this as a unique identifier.

Russ Frank
+2  A: 

How about generating a hashcode or some unique identifier to each RSS item, then storing it in the database? Then you just generate the hashcode for each item in the new RSS, and check it against the database.

Miki Watts
You just beat me to it :)
David Robbins
A: 

Off hand, a few suggestions:

  • Perform a check sum on each item in the feed, store the result in the database. Compare the results in database with each new file / stream from the RSS source.
  • Hash the title. date and time for each item and store in the database. Compare with each refreshed RSS stream.
David Robbins
+2  A: 

First you have to uniquely identify each item. This is problematic because some sites use the guid element and some sites don't, and for some items the link element never changes and for some it does. I think that the general rule of thumb is that if an item has a guid you use that as the key, otherwise you use the link as the key and hope.

Once you've established the key for an item, you can (probably) determine whether the item you're looking at has been updated by examining the pubDate element, which ought to be updated if the story gets updated.

This approach will handle most cases, though as with everything related to RSS it breaks down if the feed provider isn't behaving properly.

Robert Rossney