tags:

views:

518

answers:

3

Possible Duplicate:
How to clean HTML tags using C#

What is the best way to strip HTML tags in C#?

+1  A: 

To guarantee that no HTML tags get through, use: HttpServerUtility.HtmlEncode(string);.

If you want some to get through, you can use this "Whitelist" approach.

George Stocker
Filip Ekberg
It all depends on the result he wants. If he wants to make sure that no HTML tags are ever executed (and thus open himself up to XSS), than the first way is the 'best' way. If he just wants to have plaintext come through, a variation of the second way is 'best'.
George Stocker
He might want to remove tags to display it as clear text in an rss-feed or something. In PHP you have a built in funciton called http://php.net/strip_tags which of the sound of it is what he wants. But the whitelist solves that, you could also use that HTML Pack or whatever it is called..
Filip Ekberg
+1  A: 
  public static string StripHTML(string htmlString)
  {

     string pattern = @"<(.|\n)*?>";

     return Regex.Replace(htmlString, pattern, string.Empty);
  }
aloneguid
Nice googling..
Filip Ekberg
my pleasure, at your service, mam
aloneguid
+3  A: 

Take your HTML string or document and parse it with HTML Agility Pack. This will give you a HTMLDocument object that is very similar to a XmlDocument.

You can then use it's methods such as SelectNodes to access those portions of the document that you are interested in.

If you choose to use another approach, be aware that parsing HTML (a non-Regular language) with Regular Expressions is widely regarded as a bad idea.

And regardless of the approach, if you are keeping some markup, use a whitelist approach. This means to remove everything that is not explicitly wanted.

Lachlan Roche
HTML Agility Pack saved me one day. +1
kenny