Forcing ending tags in HTML segments or ignoring missing ending tags

views:

answers:

Forcing ending tags in HTML segments or ignoring missing ending tags

When creating an rss feed that shows a subset of a larger html doc (first x characters) I've run into an issue where some tags begin in the "first x characters" but the ending tag is outside that range. This can cause some fun problems if the consumer of the feed is trying to render the html in the feed in that it can cause unexpected rendering issues in the page showing the feed.

I'm assuming this is a common problem that rss feed writers and readers solved long ago, but I cannot seem to figure out how to achieve it short of trying to parse the html in the feed and add missing end tags which could get messy. Any suggestions would greatly be appreciated. Thanks in advance.

Chris

If you use php, an excellent solution is HTMLPurifier. It will clean it up and make it completely safe to retransmit.

DGM 2009-08-22 13:17:18

Thanks for the suggestion, unfortunately this is for a .NET project so technology wise this doesn't fit, but I will remember this for other projects in PHP.

Chris Dellinger 2009-08-23 02:08:39

Not sure if this would work for your project, but I use HTML Tidy for this in FeedDemon.

Nick Bradbury 2009-08-22 14:46:08

Thanks. This sounds promising. I'll respond back after investigating further.

Chris Dellinger 2009-08-23 02:09:10

Where does the larger document come from? If there is source text from which the HTML is generated, it's much easier to truncate that and re-generate the HTML from the truncated version than it is to deal with the problems of handling partial HTML. To do this at all properly you'd basically need to be re-parsing and serialising the HTML all over again.

HTML inside RSS is still troublesome, anyhow. You might be better off stripping all the tags and doing a simple text truncate on what's left.

bobince 2009-08-22 21:43:39

The larger document comes from user entered text from a YUI Rich text editor. More often there will be HTML formatting included in this text.

Chris Dellinger 2009-08-23 02:10:41

That's unfortunate. Processing general HTML is very very tricky to do correctly, especially caring about security. You'd probably have to get a full-blown HTML parser, turn the input into a DOM or similar object tree then prune bits off before re-serialising.

bobince 2009-08-23 11:01:27

(This is essentially what Tidy or Purifier would be doing internally.)

bobince 2009-08-23 11:02:18

ansaurus

tags:

views:

answers:

Forcing ending tags in HTML segments or ignoring missing ending tags

related questions