views:

89

answers:

3

I'm trying to get up to speed on HTML/CSS/PHP development and was wondering how I should validate my code when it contains content I can't control, like an RSS feed?

For example, my home page is a .php doc that contains HTML and PHP code. I use the PHP to create a simple RSS reader (using SimpleXML) to grab some feeds from another blog and display them on my Web page.

Now, as much as possible, I'd like to try to write valid HTML. So I'm assuming the way to do this is to view the page in the browser (I'm using NetBeans, so I click "Preview page"), copy the source (using View Source), and stick that in W3C's validator. When I do that, I get all sorts of validation errors (like "cannot generate system identifier for general entity" and "general entity "blogId" not defined and no default entity") coming from the RSS feed.

Am I following the right process for this? Should I just ignore all the errors that are flagged in the RSS feed?

Thanks.

+1  A: 

I generally use HTML Tidy to clean up the data from outside the system.

David Dorward
But that wouldn't work in this case, would it? I'm using HTML Tidy (just started using it so I don't know all its features) and although it's flagging all these errors and offering to clean them up, I don't see how that helps in the long run because my page will just grab another RSS feed which will trigger the same errors next time. (Such as not escaping ampersands.) I guess I would have to sanitize the code in some way before outputting it, as gahooa states below.
johnnyb10
I mean: integrate HTML Tidy into your page generation system.
David Dorward
+3  A: 

In this case, where you are dealing with an untrusted on uncontrolled feed, you have limited options for being safe.

Two that come to mind are:

  1. use something like striptags() to take all of the formatting out of the RSS feed content.
  2. use a library like HTMLPurifier to validate and sanitize the content before outputting.

For performance, you should cache the output-ready content, FYI.

--

Regarding Caching

There are many ways to do this... If you are using a framework, chances are it already has a way to do it. Zend_Cache is a class provided by the Zend framework, for example.

If you have access to memcached, then that is super easy. But if you don't then there are a lot of other ways.

The general concept is to prepare the output, and then store it, ready to be outputted many times. That way, you do not incur the overhead of fetching and preparing the output if it is simply going to be the same every time.

Consider this code, which will only fetch and format the RSS feed every 5 minutes... All the other requests are a fast readfile() command.

# When called, will prepare the cache
function GenCache1()
{
    //Get RSS feed
    //Parse it
    //Purify it
    //Format your output
    file_put_contents('/tmp/cache1', $output);
}

# Check to see if the file is available
if(! file_exists('/tmp/cache1'))
{
    GenCache1();
}
else
{
    # If the file is older than 5 minues (300 seconds), then regen
    $a = stat('/tmp/cache1');
    if($a['mtime'] + 300 < time())
       GenCache1();
}


# Now, simply use this code to output
readfile('/tmp/cache1');
gahooa
Thanks, I thought something like that might be necessary. Also, can you explain a little more what you mean by caching the output-ready content, and let me know (in general terms) how I'd do that?
johnnyb10
A: 

RSS should always be XML compliant. So I suggest you use XHTML for your website. Since XHTML is also XML compliant you should not have any errors when validating an XHTML page with RSS.

EDIT: Of course, this only counts if the content your getting is actually valid XML...

galaktor
RSS can include XML encoded tag soup. So, while it could be conforming XML, the data it carries might not be conforming HTML fragments.
David Dorward