views:

700

answers:

8
+7  A: 

Take a look at this Strip HTML tags from a string using regular expressions

RioTera
a better idea would be to use an html parser.
mkoryak
why if a simple regex do the job?
RioTera
@mkoryak: Could you please explain why it would be better?
Mr. Smith
This will strip tags but leave entities HTML-encoded, so it's not really a complete answer.
richardtallent
To add to what richardtallent said: malformed HTML can break a regex and cause it to strip things it shouldn't. A full HTML parser is designed to accommodate malformed HTML so you don't lose data, or gain "extra" data.
Dan Herbert
I think that if you have a malformed HTML a good solution would be fix it before store it (HTML Tidy). A malformed HTML can break your layout, depending where you display it
RioTera
A: 

You can use something like this

string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

Neil
A: 

If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:

 public static string StripTags(string value)
 {
  if (value == null)
   return string.Empty;

  string pattern = @"&.{1,8};";
  value = Regex.Replace(value, pattern, " ");
  pattern = @"<(.|\n)*?>";
  return Regex.Replace(value, pattern, string.Empty);
 }

It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...

Dan Diplo
A: 

You could:

  • Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
  • Use TinyMCE's built-in configuration options for stripping unwanted HTML.
  • Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.
richardtallent
+1  A: 

Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method

TreeUK
+6  A: 

I downloaded the HtmlAgilityPack and created this function:

string StripHtml(string html)
{
    // create whitespace between html elements, so that words do not run together
    html = html.Replace(">","> ");

    // parse html
    var doc = new HtmlAgilityPack.HtmlDocument(); 
    doc.LoadHtml(html);

    // strip html decoded text from html
    string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText); 

    // replace all whitespace with a single space and remove leading and trailing whitespace
    return Regex.Replace(text, @"\s+", " ").Trim();
}
Ronnie Overby
See my answer!
Josh Stodola
Take a look at richardtallent comment to your answer.
Ronnie Overby
Now take a look.
Josh Stodola
I saw it. I think I will just stick with the 5 lines of code I have written.
Ronnie Overby
Hilarious. Lines of code is the most important thing in programming, thanks for the reminder. With that kind of idiotic rationale, perhaps you should consider how many lines of code are behind the HtmlAgilityPack.
Josh Stodola
My wife called me an idiot this morning. I must really be one. Let the votes determine which is the best way to go, Josh.
Ronnie Overby
@Josh - You are the first person on all of StackOverflow that I have ever seen argue with an OP so that he will accept or even acknowledge your answer. Also, I forgot to say that while, yes, the HtmlAgilityPack MAY execute some more instructions to do what I need than your code sample, it is also a proven library that has been endorsed by many developers.
Ronnie Overby
I don't care if you accept my answer or not. Honestly. This "reputation" means nothing to me (nor should it). What I did not appreciate was you coming to my question about TDD and answering with "You are stupid" for absolutely no reason (except for me trying to help you, *so sorry*). And then you deleted it immediately like a silly chimp. I am sorry that high school sucked that bad for you, but it's time to get over it.
Josh Stodola
I deleted that comment because I realized that it was wrong (to behave that way). Thanks for your help. What do you think of this for my new avatar? http://tinyurl.com/sillychimp You were my inspiration! I loved high school. Go ahead and get your last word in. I'm done with this.
Ronnie Overby
A: 

As you may have malformed HTML in the system: BeautifulSoup or similar could be used.

It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?

Peter Mortensen
A: 
seagulf