tags:

views:

57

answers:

3

Guys,

I have a list of URLs which has to be processed and the result should be only the RSS Feed URLS in that list.

How to identify whether the given link is RSS Feed URL or not.

I need to build the program in Java and for ur knowledge, am a beginner in Java.

Please advise me briefly on the same. Thanks in advance.

+1  A: 

There are a few things you can try, off of the top of my head:

  1. See what Content-Type the server returns for the given URL. However, this may not be definitive and a server may not necessarily return the correct header.
  2. Try to parse the content of the URL as RSS and see if it is successful - this is likely the only definitive proof that a given URL is a RSS feed.
matt b
+1  A: 

Given just the URL, there's no way to be 100% sure. RSS files are normally .xml, but are not (as far as I can tell) required to have that suffix. If you just categorized based on ".xml" or not, you'd have a lot of mistakes - classifying lots of non-RSS files as RSS and some that are RSS files as non-RSS.

To really be sure, you need to actually fetch the file at the specified URLs and parse it. You should probably find a library to do this because parsing it yourself is probably a nightmare. This library looks reasonable: http://www.davidpashley.com/projects/eddie.html You could probably load each URL's contents, hand it to the library, and if the library successfully parses it mark it as an RSS or Atom feed. You may have false negatives, but they'll be way less frequent than if you tried to categorize based on URL alone.

If all your care about is RSS and performance is an issue (ie you don't want to boot up a SAX parser for each file) you could read up on the RSS specification (http://cyber.law.harvard.edu/rss/rss.html) and just do some simple string searching for files that look broadly like they might be RSS files. You'll have more false positives (and probably some false negatives) but it'll be faster. It all depends on how much time you want to spend on this and how sure you need to be. But to have any accuracy at all, you'll need to be downloading each file to check it.

drewww
A: 

An RSS documents is an XML file. The format of the XML file is given in the RSS Specification. You can use XML parsers in Java to read and create RSS feeds.
Here a tutorial that might help: RSS feeds with Java.

Zaki