ansaurus

Question

How to detect if a page is an RSS or ATOM feed

Answer 1

+1 A:

Why not try to parse your data with a component built specifically to parse RSS/ATOM Feed, like Zend_Feed_Reader ?

With that, if the parsing succeeds, you'll be pretty sure that the URL you used is indeed a valid RSS/ATOM feed.

And I should add that you could use such a component to parse feed in order to extract their informations, too : no need to re-invent the wheel, parsing the XML "by hand", and dealing with special cases yourself.

Pascal MARTIN 2010-03-14 17:16:18

Using simplexml_load_string and parsing by hand is working for me, it's detecting the difference between website and feed thats the issue. Thanks though ;)

Pepper 2010-03-14 17:28:26

What if the feed is badly formed XML? Are you able to parse all of the extensions to feeds like tags and enclosures? Maybe you don't care about these things, but my experience is that feeds are not as standardized as you might expect and using an existing library will keep you from reinventing the wheel.

Jackson Miller 2010-03-14 17:51:06

Ill give Zend_Feed_Reader a try. I tried SimplePie early in the project and I had a higher success rate parsing it myself. You're right about feeds not being standardized, its a mess out there.

Pepper 2010-03-14 17:55:28

Answer 2

+3 A:

I would sniff for the various unique identifiers those formats have:

Atom: Source

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"&gt;

RSS 0.90: Source

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/"&gt;

Netscape RSS 0.91

<rss version="0.91">

etc. etc. (See the 2nd source link for a full overview).

As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.

Pekka 2010-03-14 17:18:00

Yep, they are suppose to have those tag identifiers. But there are so many badly formed feeds and different versions out there, I cant rely on it. Looking for the <html> or <body> tag is interesting. Ill test that out.

Pepper 2010-03-14 17:35:43

@Pepper yes, maybe compile lists of tags to sniff for? `html` and `body` for HTML, `rdf` and `item` (IIRC) for RSS, `feed` for Atom....

Pekka 2010-03-14 17:57:17

Answer 3

A:

Pepper,

Use the Content-Type HTTP response header to dispatch to the right handler.

Jan

Jan Algermissen 2010-03-14 18:24:18

I think his problem goes deeper, he needs to work with many RSS sources, many of which not even serving valid markup in their chosen format - let alone sending the correct content-type header.

Pekka 2010-03-14 18:25:53

ansaurus

tags:

views:

answers:

How to detect if a page is an RSS or ATOM feed

related questions