tags:

views:

670

answers:

3

Hi,

I have a string which might include br or span.../span tags or other HTML characters/entities. I want a robust way of stripping all that and getting the remaining UTF-8 characters. This be should be cross-platform, ideally.

Something like this would be ideal:

http://snipplr.com/view/15261/python-decode-and-strip-html-entites-to-unicode/

but that also removes the tags.

Thanks!

A: 

Do you want to simply delete the elements, or to convert HTML to plain text?

Option 1:

If you just want to delete all occurances of and you can use a regex search and replace.

Option 2:

If what you're really trying to do is take a page that has formatting and convert it to plain text, the simplest and most robust way I can think of is to use a browser, or some browser engine, to actually parse the HTML and extract the text from it.

IOW, this is equivalent to copying a web page from the browser into the clipboard and then pasting it into notepad.

Assaf Lavie
+1  A: 

Just how stringent are your requirements? A simple two-state FSA ought to do. Start in the READCHAR state. Whenever you read a '<' in that state, transition to the READTAG state; otherwise, write the character to your result string. Whenever you're in the READTAG state and read a '>', transition back to the READCHAR state.

Edit: Oops. Missed the part of about entities. You'll nead a READENTITY state for that too. When you transition out of it, you could also convert the code into the corresponding UTF-8 character.

Peter Ruderman
To note, more states are required, because attributes may contain ">".
strager
That's true, which is why I asked how stringent his requirements are. A '>' in a tag is fairly unlikely but certainly could happen. Similiarly, the algorithm will need to be more complex if you have to deal with potentially malformed HTML or take special actions for certain tags.
Peter Ruderman
The OP states "robust" which probably means "works as a human would expect, assuming they fully understand the standard, in all cases". So ">" in an attribute would likely need to be handled.
strager
A: 

I'm not clear on what you want.

Input: This is a string

of text & on many lines "

Should this output:

1) This is a string <br> <br /> of text & on many lines "   (Replace &amp; with & and &quot; with ") 
2) This is a string of text & on many lines "
chocojosh
He wants the opposite. See the question: "remove all HTML formatting".
strager
"Something like this would be ideal:http://snipplr.com/view/15261/python-decode-and-strip-html-entites-to-unicode/but that also removes the tags."The word but makes me think he does not want to remove the tags.
chocojosh