views:

142

answers:

1

Hi,

Long version:

Those familiar to the standardization nightmare of the RSS-family, may know that RSS does not provide you with information if for example the "description" element contains just plain text or html or xhtml.

I currently use the ROME-API to convert from various RSS versions to Atom 1.0. The Rome-API will happily parse the RSS and later output an Atom feed. Atom fortunately has a means to declare a summary to contain text, html or xhtml.

Example. RSS:

 <item>
       <link>http://www.schwarzwaelder-bote.de/wm?catId=79039&amp;amp;artId=14737088&amp;amp;rss=true&lt;/link&gt;
        <title>Analyse: Winter reißt Löcher in Straßen und Kassen</title>
        <description>&lt;img src="http://www.schwarzwaelder-bote.de/cms_images/swol/dpa-InfoLine_rs-images/20100306/1192a_24128948.thumbnail.jpg" alt="Schlagloch" title="" border="0"&gt;&amp;nbsp;&amp    ;nbsp;&amp;nbsp;Berlin (dpa) - Von Schnee und Eis befreit sind Deutschlands Straßen, und jetzt geht es ans große Aufräumen....</description>
      </item>

becomes: ATOM:

<entry>
  <title>Analyse: Winter reißt Löcher in Straßen und Kassen</title>
  <link rel="alternate" href="http://www.schwarzwaelder-bote.de/wm?catId=79039&amp;amp;artId=14737088&amp;amp;rss=true" />
  <author>
    <name />
  </author>
  <id>http://www.schwarzwaelder-bote.de/wm?catId=79039&amp;amp;artId=14737088&amp;amp;rss=true&lt;/id&gt;
  <summary type="text">&lt;img src="http://www.schwarzwaelder-bote.de/cms_images/swol/dpa-InfoLine_rs-images/20100306/1192a_24128948.thumbnail.jpg" alt="Schlagloch" title="" border="0"&gt;&amp;nbs    p;&amp;nbsp;&amp;nbsp;Berlin (dpa) - Von Schnee und Eis befreit sind Deutschlands Straßen, und jetzt geht es ans große Aufräumen....</summary>
</entry>

The problem is type="text" which tells feed-readers like firefox to render the content of the summary as text --> you get to see all the html-source.

Short version: How do I detect that the content of the description element is (X)HTML so I can set the correct type attribute?

A: 

Heh, my grandad used to read that newspaper :)

A very primitive approach to detecting HTML could be stripping any tags out of the source (in PHP, you would do that with strip_tags()) and see whether the result differs from the original. With reference to the chaos that is RSS, you may have to run this twice, once before, once after a html_entity_decode(), though, so both entity-encoded and non-encoding tags get detected reliably.

Usually, that should produce half-way reliable results but then I saw the ö in this:

   <title>Analyse: Dem Mutigen geh<F6>rt die Urne</title>

What kind of encoding method is this? I've never seen that before. That would of course be (mis)interpreted as a HTML tag. Is this something atom specific?

Pekka
strip_tags is primitive, and will eat everything that is even slightly tag-like (e.g. `1<2`). It doesn't look at entities. It's absolutely unsuitable for detecting HTML.
porneL
Sorry, never seen the encoding of the "ö" like that before.
er4z0r
@er4 what server-side languages (if any) can you use? Edit: Ah, I overlooked the `Java` tag. You could look at some Java HTML stripping library, I still think that is the best way to go. Maybe with a list of what tags to strip, taken from a list of valid HTML 4 tag names. That would leave edge cases like the strangely encoded `ö` alone.
Pekka
Thank you Pekka. I am currently looking into nekohtml and jtidy to see if they could do the trick.
er4z0r
OK. I could not come up with a good solution using neither of the two. So for now I am just looking for entities and if I find any it's got to be html ;-)
er4z0r