How to crawl a feed | ansaurus

tags:

views:

121

answers:

1

+1 Q:

How to crawl a feed

My application needs to keep track of RSS/Atom feeds and save the new entries in a database. My question is, What is the most reliable method to determine whether an entry in a feed has already been crawled or not? I use Universal Feed Parser module to parse the feeds. My current implementation keeps record of the latest value of feed.entry[i].updated_parsed, when crawling if updated_parsed value of an entry is greater than the recorded value then that entry is saved in the database. The problem here is that many feeds dont have a published date or updated date.

+1 A:

You should be determining whether you've already crawled an entry by reference to its <guid> primarily (falling back to <link> in the absence of a <guid>), and anything to do with dates only as a secondary analysis.

chaos 2009-03-28 05:25:46

related questions

Add RSS to any website?

What is the best .Net library to handle feeds (Atom+RSS)

Using Yahoo! Pipes

PHP - RSS builder

Whats the best windows tool for merging RSS Feeds?

How do you customize the RSS feeds in SharePoint

What are your favourite programming-related RSS feeds?

Which library should I use to generate RSS in Common Lisp?

How can I import a raw RSS feed in C#?

What Feed Reader libraries for Java are best?

How To Discover RSS Feeds for a given URL

How to build large/busy RSS feed

ASP.Net RSS feed

Recommended Python RSS/Atom feed generator?

Rss feed for game programmer?

Programatically determine how many Comments a blog post has

merge rss feeds

RSS Feed Library for (Unmanaged) C++

SelectNodes not working on stackoverflow feed

Best Practices for AS3 XML Parsing

RSS/Atom for professional use

SSRS - RSS Feeds

Why Are People Still Creating RSS Feeds?

RSS Feeds in ASP.NET MVC

Monitor a specific RSS