tags:

views:

2639

answers:

7
A: 

Hi buddy, simple as:

public static string StripTags2(string html)
    {
        return html.Replace("<", "&lt;").Replace(">", "&gt;");
    }

By this you escape all "<" and ">" in a string. Is this what you want?

José Leal
...ah. Well now the answer (along with interpretation of the ambiguous question) has completely changed, I'll pick nits at the lack of encoding instead. ;-)
bobince
I don't think it is a good idea to reinvent the wheel - especially when your wheel is square. You should use HTMLEncode instead.
Kramii
A: 

You may want to use SgmlReader.

http://code.msdn.microsoft.com/SgmlReader

Leonardo Herrera
+2  A: 

If you have data that has HTML tags and you want to display it so that a person can SEE the tags, use HttpServerUtility::HtmlEncode.

If you have data that has HTML tags in it and you want the user to see the tags rendered, then display the text as is. If the text represents an entire web page, use an IFRAME for it.

If you have data that has HTML tags and you want to strip out the tags and just display the unformatted text, use a regular expression.

Corey Trager
in php there is a function called striptags() maybe you have something similar
tharkun
+4  A: 

If you are talking about tag stripping, it is relatively straight forward if you don't have to worry about things like <script> tags. If all you need to do is display the text without the tags you can accomplish that with a regular expression:

<[^>]*>

If you do have to worry about <script> tags and the like then you'll need something a bit more powerful then regular expressions because you need to track state, omething more like a Context Free Grammar (CFG). Althought you might be able to accomplish it with 'Left To Right' or non-greedy matching.

If you can use regular expressions there are many web pages out there with good info:

If you need the more complex behaviour of a CFG I would suggest using a third party tool, unfortunately I don't know of a good one to recommend.

vfilby
You also have to worry about > in attribute values, comments, PIs/CDATA in XML and various common malformednesses in legacy HTML. In general [X][HT]ML is not amenable to parsing with regexps.
bobince
You can accomodate the > in attribute values but making attributes a part of the regular expression. It is only the complexity of nested tags that limits the usefulness of parsing with regular expressions.
vfilby
don't you mean <[^>]*> which matches things like <html>, and not <[^>]>* which matches things like <h>>>> ?
Greg
You are correct sir, typo fixed.
vfilby
A: 

Depends on what you mean by "html." The most complex case would be complete web pages. That's also the easiest to handle, since you can use a text-mode web browser. See the Wikipedia article listing web browsers, including text mode browsers. Lynx is probably the best known, but one of the others may be better for your needs.

mpez0
+7  A: 

HTTPUTility.HTMLEncode() is meant to handle encoding HTML tags as strings. It takes care of all the heavy lifting for you. From the MSDN Documentation:

If characters such as blanks and punctuation are passed in an HTTP stream, they might be misinterpreted at the receiving end. HTML encoding converts characters that are not allowed in HTML into character-entity equivalents; HTML decoding reverses the encoding. For example, when embedded in a block of text, the characters < and >, are encoded as &lt; and &gt; for HTTP transmission.

HTTPUtility.HTMLEncode() method, detailed here:

public static void HtmlEncode(
string s,
TextWriter output
)

Usage:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

I hope that helps.

George Stocker
A really good answer George thanks, it also highlighted how poorly I asked the question first time around. Sorry.
Stuart Helwig
+9  A: 

The free and open source HtmlAgilityPack has a method:

var plainText = ConvertToPlainText(string html);

Feed it an HTML string like

<b>hello world!</b><br /><i>it is me! !</i>

And you'll get a plain text result like:

hello world!
it is me!
Judah Himango
I have used HtmlAgilityPack before but I can't see any reference to ConvertToPlainText. Are you able to tell me where i can find it?
horatio
Horatio, it is included in one of the samples that comes with HtmlAgilityPack: http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772?projectName=htmlagilitypack#52179
Judah Himango
Thanks for that
horatio
Thank you for this! :)
xraminx