My application needs to keep track of RSS/Atom feeds and save the new entries in a database. My question is, What is the most reliable method to determine whether an entry in a feed has already been crawled or not? I use Universal Feed Parser module to parse the feeds. My current implementation keeps record of the latest value of feed.entry[i].updated_parsed, when crawling if updated_parsed value of an entry is greater than the recorded value then that entry is saved in the database. The problem here is that many feeds dont have a published date or updated date.
+1
A:
You should be determining whether you've already crawled an entry by reference to its <guid> primarily (falling back to <link> in the absence of a <guid>), and anything to do with dates only as a secondary analysis.
chaos
2009-03-28 05:25:46