ansaurus

Question

C++: Remove all HTML formatting from string?

Answer 1

A:

Do you want to simply delete the elements, or to convert HTML to plain text?

Option 1:

If you just want to delete all occurances of and you can use a regex search and replace.

Option 2:

If what you're really trying to do is take a page that has formatting and convert it to plain text, the simplest and most robust way I can think of is to use a browser, or some browser engine, to actually parse the HTML and extract the text from it.

IOW, this is equivalent to copying a web page from the browser into the clipboard and then pasting it into notepad.

Assaf Lavie 2009-06-11 02:55:24

Answer 2

+1 A:

Just how stringent are your requirements? A simple two-state FSA ought to do. Start in the READCHAR state. Whenever you read a '<' in that state, transition to the READTAG state; otherwise, write the character to your result string. Whenever you're in the READTAG state and read a '>', transition back to the READCHAR state.

Edit: Oops. Missed the part of about entities. You'll nead a READENTITY state for that too. When you transition out of it, you could also convert the code into the corresponding UTF-8 character.

Peter Ruderman 2009-06-11 02:55:33

To note, more states are required, because attributes may contain ">".

strager 2009-06-11 03:02:05

That's true, which is why I asked how stringent his requirements are. A '>' in a tag is fairly unlikely but certainly could happen. Similiarly, the algorithm will need to be more complex if you have to deal with potentially malformed HTML or take special actions for certain tags.

Peter Ruderman 2009-06-11 12:48:09

The OP states "robust" which probably means "works as a human would expect, assuming they fully understand the standard, in all cases". So ">" in an attribute would likely need to be handled.

strager 2009-06-11 18:49:27

Answer 3

A:

I'm not clear on what you want.

Input: This is a string

of text & on many lines "

Should this output:

1) This is a string <br> <br /> of text & on many lines "   (Replace &amp; with & and &quot; with ") 
2) This is a string of text & on many lines "

chocojosh 2009-06-11 03:15:13

He wants the opposite. See the question: "remove all HTML formatting".

strager 2009-06-11 03:21:16

"Something like this would be ideal:http://snipplr.com/view/15261/python-decode-and-strip-html-entites-to-unicode/but that also removes the tags."The word but makes me think he does not want to remove the tags.

chocojosh 2009-06-11 12:32:23

ansaurus

tags:

views:

answers:

C++: Remove all HTML formatting from string?

related questions