ansaurus

Question

Distinguish between HTML/XHTML and plain text in a RSS description-element

Answer 1

A:

Heh, my grandad used to read that newspaper :)

A very primitive approach to detecting HTML could be stripping any tags out of the source (in PHP, you would do that with strip_tags()) and see whether the result differs from the original. With reference to the chaos that is RSS, you may have to run this twice, once before, once after a html_entity_decode(), though, so both entity-encoded and non-encoding tags get detected reliably.

Usually, that should produce half-way reliable results but then I saw the ö in this:

   <title>Analyse: Dem Mutigen geh<F6>rt die Urne</title>

What kind of encoding method is this? I've never seen that before. That would of course be (mis)interpreted as a HTML tag. Is this something atom specific?

Pekka 2010-03-07 16:17:41

strip_tags is primitive, and will eat everything that is even slightly tag-like (e.g. `1<2`). It doesn't look at entities. It's absolutely unsuitable for detecting HTML.

porneL 2010-03-07 17:13:21

Sorry, never seen the encoding of the "ö" like that before.

er4z0r 2010-03-07 17:48:26

@er4 what server-side languages (if any) can you use? Edit: Ah, I overlooked the `Java` tag. You could look at some Java HTML stripping library, I still think that is the best way to go. Maybe with a list of what tags to strip, taken from a list of valid HTML 4 tag names. That would leave edge cases like the strangely encoded `ö` alone.

Pekka 2010-03-07 17:50:50

Thank you Pekka. I am currently looking into nekohtml and jtidy to see if they could do the trick.

er4z0r 2010-03-07 18:36:44

OK. I could not come up with a good solution using neither of the two. So for now I am just looking for entities and if I find any it's got to be html ;-)

er4z0r 2010-03-14 21:09:05

ansaurus

tags:

views:

answers:

Distinguish between HTML/XHTML and plain text in a RSS description-element

related questions