ansaurus

Question

getting text that will be displayed to user from html

Answer 1

A:

I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.

If you have

<strong> test </strong>

(and other tags that change the look of the test) you could place in conditions to not place a full stop here.

Sleepy Rhino 2010-06-13 10:12:51

Answer 2

+2 A:

Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.

Craig Trader 2010-06-13 11:14:46

thanks, I will get on with coding it I guess ;-), i was sort of thinking there might be something clever out there - possibly something used as part of accessibility? that might have a better system for deciding what the user should be reading / seeing.

gordatron 2010-06-13 20:27:43

Answer 3

A:

HTML parsers seem to be a reasonable starting point for this.

there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.

They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.

But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.

there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)

gordatron 2010-07-18 20:26:00

ansaurus

tags:

views:

answers:

getting text that will be displayed to user from html

related questions