ansaurus

Question

Answer 1

+1 A:

I generally use HTML Tidy to clean up the data from outside the system.

David Dorward 2009-08-21 15:57:26

But that wouldn't work in this case, would it? I'm using HTML Tidy (just started using it so I don't know all its features) and although it's flagging all these errors and offering to clean them up, I don't see how that helps in the long run because my page will just grab another RSS feed which will trigger the same errors next time. (Such as not escaping ampersands.) I guess I would have to sanitize the code in some way before outputting it, as gahooa states below.

johnnyb10 2009-08-21 16:06:11

I mean: integrate HTML Tidy into your page generation system.

David Dorward 2009-08-21 18:56:08

Answer 2

+3 A:

In this case, where you are dealing with an untrusted on uncontrolled feed, you have limited options for being safe.

Two that come to mind are:

use something like striptags() to take all of the formatting out of the RSS feed content.
use a library like HTMLPurifier to validate and sanitize the content before outputting.

For performance, you should cache the output-ready content, FYI.

--

Regarding Caching

There are many ways to do this... If you are using a framework, chances are it already has a way to do it. Zend_Cache is a class provided by the Zend framework, for example.

If you have access to memcached, then that is super easy. But if you don't then there are a lot of other ways.

The general concept is to prepare the output, and then store it, ready to be outputted many times. That way, you do not incur the overhead of fetching and preparing the output if it is simply going to be the same every time.

Consider this code, which will only fetch and format the RSS feed every 5 minutes... All the other requests are a fast readfile() command.

# When called, will prepare the cache
function GenCache1()
{
    //Get RSS feed
    //Parse it
    //Purify it
    //Format your output
    file_put_contents('/tmp/cache1', $output);
}

# Check to see if the file is available
if(! file_exists('/tmp/cache1'))
{
    GenCache1();
}
else
{
    # If the file is older than 5 minues (300 seconds), then regen
    $a = stat('/tmp/cache1');
    if($a['mtime'] + 300 < time())
       GenCache1();
}


# Now, simply use this code to output
readfile('/tmp/cache1');

gahooa 2009-08-21 16:01:08

Thanks, I thought something like that might be necessary. Also, can you explain a little more what you mean by caching the output-ready content, and let me know (in general terms) how I'd do that?

johnnyb10 2009-08-21 16:07:36

Answer 3

A:

RSS should always be XML compliant. So I suggest you use XHTML for your website. Since XHTML is also XML compliant you should not have any errors when validating an XHTML page with RSS.

EDIT: Of course, this only counts if the content your getting is actually valid XML...

galaktor 2009-08-21 16:09:55

RSS can include XML encoded tag soup. So, while it could be conforming XML, the data it carries might not be conforming HTML fragments.

David Dorward 2009-08-21 18:56:55

ansaurus

tags:

views:

answers:

How to validate HTML with RSS?

related questions