ansaurus

Question

Answer 1

A:

Hi buddy, simple as:

public static string StripTags2(string html)
    {
        return html.Replace("<", "&lt;").Replace(">", "&gt;");
    }

By this you escape all "<" and ">" in a string. Is this what you want?

José Leal 2008-11-13 12:37:21

...ah. Well now the answer (along with interpretation of the ambiguous question) has completely changed, I'll pick nits at the lack of encoding instead. ;-)

bobince 2008-11-13 12:50:31

I don't think it is a good idea to reinvent the wheel - especially when your wheel is square. You should use HTMLEncode instead.

Kramii 2008-11-13 15:28:23

Answer 2

A:

You may want to use SgmlReader.

http://code.msdn.microsoft.com/SgmlReader

Leonardo Herrera 2008-11-13 12:40:30

Answer 3

+2 A:

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

Corey Trager 2008-11-13 12:41:31

in php there is a function called striptags() maybe you have something similar

tharkun 2008-11-13 22:46:41

Answer 4

+4 A:

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

vfilby 2008-11-13 12:44:51

You also have to worry about > in attribute values, comments, PIs/CDATA in XML and various common malformednesses in legacy HTML. In general [X][HT]ML is not amenable to parsing with regexps.

bobince 2008-11-13 12:58:00

You can accomodate the > in attribute values but making attributes a part of the regular expression. It is only the complexity of nested tags that limits the usefulness of parsing with regular expressions.

vfilby 2008-11-16 15:33:30

don't you mean <[^>]*> which matches things like <html>, and not <[^>]>* which matches things like <h>>>> ?

Greg 2009-06-30 13:16:35

You are correct sir, typo fixed.

vfilby 2009-07-13 19:15:38

Answer 5

A:

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

mpez0 2008-11-13 12:46:54

Answer 6

+7 A:

HTTPUTility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as < and > for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
string s,
TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

I hope that helps.

George Stocker 2008-11-13 13:42:10

A really good answer George thanks, it also highlighted how poorly I asked the question first time around. Sorry.

Stuart Helwig 2008-11-14 00:38:51

Answer 7

+9 A:

The free and open source HtmlAgilityPack has a method:

var plainText = ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello world!</b><br /><i>it is me! !</i>

And you'll get a plain text result like:

hello world!
it is me!

Judah Himango 2009-07-13 19:17:44

I have used HtmlAgilityPack before but I can't see any reference to ConvertToPlainText. Are you able to tell me where i can find it?

horatio 2010-01-08 03:43:34

Horatio, it is included in one of the samples that comes with HtmlAgilityPack: http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772?projectName=htmlagilitypack#52179

Judah Himango 2010-01-08 15:37:00

Thanks for that

horatio 2010-01-13 05:27:06

Thank you for this! :)

xraminx 2010-06-11 15:24:27

ansaurus

tags:

views:

answers:

How do you convert Html to plain text?

related questions