I'm currently using Magpie RSS but it sometimes falls over when the RSS or Atom feed isn't well formed. Are there any other options for parsing RSS and Atom feeds with PHP?
I use SimplePie to parse a Google Reader feed and it works pretty well and has a decent feature set.
Of course, I haven't tested it with non-well-formed RSS / Atom feeds so I don't know how it copes with those, I'm assuming Google's are fairly standards compliant! :)
The HTML Tidy library is able to fix some malformed XML files. Running your feeds through that before passing them on to the parser may help.
I've always used the SimpleXML functions built in to PHP to parse XML documents. It's one of the few generic parsers out there that has an intuitive structure to it, which makes it extremely easy to build a meaningful class for something specific like an RSS feed. Additionally, it will detect XML warnings and errors, and upon finding any you could simply run the source through something like HTML Tidy (as ceejayoz mentioned) to clean it up and attempt it again.
Consider this very rough, simple class using SimpleXML:
<?php
class BlogPost
{
var $date;
var $ts;
var $link;
var $title;
var $text;
}
class BlogFeed
{
var $posts = array();
function BlogFeed($file_or_url)
{
if(!eregi('^http:', $file_or_url))
$feed_uri = $_SERVER['DOCUMENT_ROOT'] .'/shared/xml/'. $feed_or_url;
else
$feed_uri = $feed_or_url;
$xml_source = file_get_contents($feed_uri);
$x = simplexml_load_string($xml_source);
if(count($x) == 0)
return;
foreach($x->channel->item as $item)
{
$post = new BlogPost();
$post->date = (string) $item->pubDate;
$post->ts = strtotime($item->pubDate);
$post->link = (string) $item->link;
$post->title = (string) $item->title;
$post->text = (string) $item->description;
// Create summary as a shortened body and remove images, extraneous line breaks, etc.
$summary = $post->text;
$summary = eregi_replace("<img[^>]*>", "", $summary);
$summary = eregi_replace("^(<br[ ]?/>)*", "", $summary);
$summary = eregi_replace("(<br[ ]?/>)*$", "", $summary);
// Truncate summary line to 100 characters
$max_len = 100;
if(strlen($summary) > $max_len)
$summary = substr($summary, 0, $max_len) . '...';
$post->summary = $summary;
$this->posts[] = $post;
}
}
}
?>
If feed isn't well-formed XML, you're supposed to reject it, no exceptions. You're entitled to call feed creator a bozo.
Otherwise you're paving way to mess that HTML ended up in.
Personally I use BNC Advanced Feed Parser- i like the template system that is very easy to use