views:

65

answers:

3

When creating an rss feed that shows a subset of a larger html doc (first x characters) I've run into an issue where some tags begin in the "first x characters" but the ending tag is outside that range. This can cause some fun problems if the consumer of the feed is trying to render the html in the feed in that it can cause unexpected rendering issues in the page showing the feed.

I'm assuming this is a common problem that rss feed writers and readers solved long ago, but I cannot seem to figure out how to achieve it short of trying to parse the html in the feed and add missing end tags which could get messy. Any suggestions would greatly be appreciated. Thanks in advance.

Chris

A: 

If you use php, an excellent solution is HTMLPurifier. It will clean it up and make it completely safe to retransmit.

DGM
Thanks for the suggestion, unfortunately this is for a .NET project so technology wise this doesn't fit, but I will remember this for other projects in PHP.
Chris Dellinger
A: 

Not sure if this would work for your project, but I use HTML Tidy for this in FeedDemon.

Nick Bradbury
Thanks. This sounds promising. I'll respond back after investigating further.
Chris Dellinger
A: 

Where does the larger document come from? If there is source text from which the HTML is generated, it's much easier to truncate that and re-generate the HTML from the truncated version than it is to deal with the problems of handling partial HTML. To do this at all properly you'd basically need to be re-parsing and serialising the HTML all over again.

HTML inside RSS is still troublesome, anyhow. You might be better off stripping all the tags and doing a simple text truncate on what's left.

bobince
The larger document comes from user entered text from a YUI Rich text editor. More often there will be HTML formatting included in this text.
Chris Dellinger
That's unfortunate. Processing general HTML is very very tricky to do correctly, especially caring about security. You'd probably have to get a full-blown HTML parser, turn the input into a DOM or similar object tree then prune bits off before re-serialising.
bobince
(This is essentially what Tidy or Purifier would be doing internally.)
bobince