views:

356

answers:

3

Hello,

I'm currently building a new online Feed Reader in PHP. One of the features i'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper tag.

The problem is, the way im currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now im taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.

$xml = @simplexml_load_string( $site_found['content'] );

if( !$xml ) // this is a website, not a feed
{
    // handle website
}
else
{
    // parse feed
}

Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.

Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?

Thanks,

Pepper http://feedingo.com

+1  A: 

Why not try to parse your data with a component built specifically to parse RSS/ATOM Feed, like Zend_Feed_Reader ?

With that, if the parsing succeeds, you'll be pretty sure that the URL you used is indeed a valid RSS/ATOM feed.


And I should add that you could use such a component to parse feed in order to extract their informations, too : no need to re-invent the wheel, parsing the XML "by hand", and dealing with special cases yourself.

Pascal MARTIN
Using simplexml_load_string and parsing by hand is working for me, it's detecting the difference between website and feed thats the issue. Thanks though ;)
Pepper
What if the feed is badly formed XML? Are you able to parse all of the extensions to feeds like tags and enclosures? Maybe you don't care about these things, but my experience is that feeds are not as standardized as you might expect and using an existing library will keep you from reinventing the wheel.
Jackson Miller
Ill give Zend_Feed_Reader a try. I tried SimplePie early in the project and I had a higher success rate parsing it myself. You're right about feeds not being standardized, its a mess out there.
Pepper
+3  A: 

I would sniff for the various unique identifiers those formats have:

Atom: Source

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"&gt;

RSS 0.90: Source

<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/"&gt;

Netscape RSS 0.91

<rss version="0.91">

etc. etc. (See the 2nd source link for a full overview).

As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.

You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)

If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.

what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.

Pekka
Yep, they are suppose to have those tag identifiers. But there are so many badly formed feeds and different versions out there, I cant rely on it. Looking for the <html> or <body> tag is interesting. Ill test that out.
Pepper
@Pepper yes, maybe compile lists of tags to sniff for? `html` and `body` for HTML, `rdf` and `item` (IIRC) for RSS, `feed` for Atom....
Pekka
A: 

Pepper,

Use the Content-Type HTTP response header to dispatch to the right handler.

Jan

Jan Algermissen
I think his problem goes deeper, he needs to work with many RSS sources, many of which not even serving valid markup in their chosen format - let alone sending the correct content-type header.
Pekka