views:

63

answers:

3

Bit of a random one, i am wanting to have a play with some NLP stuff and I would like to:

Get all the text that will be displayed to the user in a browser from HTML.

My ideal output would not have any tags in it and would only have fullstops (and any other punctuation used) and new line characters, though i can tolerate a fairly reasonable amount of failure in this (random other stuff ending up in output).

If there was a way of inserting a newline or full stop in situations where the content was likely not to continue on then that would be considered an added bonus. e.g:

items in an ul or option tag could be separated by full stops (or to be honest just ignored).

I am working Java, but would be interested in seeing any code that does this.

I can (and will if required) come up with something to do this, just wondered if there was anything out there like this already, as it would probably be better than what I come up with in an afternoon ;-).

An example of the code I might write if I do end up doing this would be to use a SAX parser to find content in p tags, strip it of any span or strong etc tags, and add a full stop if I hit a div or another p without having had a fullstop.

Any pointers or suggestions very welcome.

A: 

I would just strip everything out that has <> tags and if you want to have a full stop at the end of every sentence you check for closing tags and place a full stop.

If you have

<strong> test </strong>

(and other tags that change the look of the test) you could place in conditions to not place a full stop here.

Sleepy Rhino
+2  A: 

Hmmm ... almost any HTML parser could be used to create the effect you want -- just run through all of the tags and emit only the text elements, and emit a LF for the closing tag of every block element. As you say, a SAX implementation would be simple and straight-forward.

Craig Trader
thanks, I will get on with coding it I guess ;-), i was sort of thinking there might be something clever out there - possibly something used as part of accessibility? that might have a better system for deciding what the user should be reading / seeing.
gordatron
A: 

HTML parsers seem to be a reasonable starting point for this.

there are a number of them for example: HTMLCleaner and Nekohtml seem to work fine.

They are good as they fix the tags to allow you to more consistently process them, even if you are just removing them.

But as it turns out you probably want to get rid of script tags meta data etc. And in that case you are better working with well formed XML which these guy get for you from "wild" html.

there are many SO questions relating to this (like this one) you should search for "HTML parsing" though ;-)

gordatron