ansaurus

Question

How Can I strip HTML from Text in .NET?

Answer 1

+7 A:

Take a look at this Strip HTML tags from a string using regular expressions

RioTera 2009-08-28 19:59:50

a better idea would be to use an html parser.

mkoryak 2009-08-28 20:00:28

why if a simple regex do the job?

RioTera 2009-08-28 20:02:30

@mkoryak: Could you please explain why it would be better?

Mr. Smith 2009-08-28 20:03:20

This will strip tags but leave entities HTML-encoded, so it's not really a complete answer.

richardtallent 2009-08-28 20:28:21

To add to what richardtallent said: malformed HTML can break a regex and cause it to strip things it shouldn't. A full HTML parser is designed to accommodate malformed HTML so you don't lose data, or gain "extra" data.

Dan Herbert 2009-08-28 20:29:19

I think that if you have a malformed HTML a good solution would be fix it before store it (HTML Tidy). A malformed HTML can break your layout, depending where you display it

RioTera 2009-08-29 11:51:31

Answer 2

A:

You can use something like this

string strwithouthtmltag;
strwithouthtmltag = Regex.Replace(strWithHTMLTags, "<[^>]*>", string.Empty)

Neil 2009-08-28 20:07:14

Answer 3

A:

If you are just storing text for indexing then you probably want to do a bit more than just remove the HTML, such as ignoring stop-words and removing words shorter than (say) 3 characters. However, a simple tag and stripper I once wrote goes something like this:

 public static string StripTags(string value)
 {
  if (value == null)
   return string.Empty;

  string pattern = @"&.{1,8};";
  value = Regex.Replace(value, pattern, " ");
  pattern = @"<(.|\n)*?>";
  return Regex.Replace(value, pattern, string.Empty);
 }

It's old and I'm sure it can be optimised (perhaps using a compiled reg-ex?). But it does work and may help...

Dan Diplo 2009-08-28 20:19:33

Answer 4

A:

You could:

Use a plain old TEXTAREA (styled for height/width/font/etc.) rather than TinyMCE.
Use TinyMCE's built-in configuration options for stripping unwanted HTML.
Use HtmlDecode(RegEx.Replace(mystring, "<[^>]+>", "")) on the server.

richardtallent 2009-08-28 20:20:52

Answer 5

+1 A:

Here's Jeff Atwood's RefactorMe code link for his Sanitize HTML method

TreeUK 2009-08-28 20:31:31

Answer 6

+6 A:

I downloaded the HtmlAgilityPack and created this function:

string StripHtml(string html)
{
    // create whitespace between html elements, so that words do not run together
    html = html.Replace(">","> ");

    // parse html
    var doc = new HtmlAgilityPack.HtmlDocument(); 
    doc.LoadHtml(html);

    // strip html decoded text from html
    string text = HttpUtility.HtmlDecode(doc.DocumentNode.InnerText); 

    // replace all whitespace with a single space and remove leading and trailing whitespace
    return Regex.Replace(text, @"\s+", " ").Trim();
}

Ronnie Overby 2009-08-28 21:07:58

See my answer!

Josh Stodola 2009-08-28 21:14:02

Take a look at richardtallent comment to your answer.

Ronnie Overby 2009-08-28 21:19:02

Now take a look.

Josh Stodola 2009-08-31 13:55:35

I saw it. I think I will just stick with the 5 lines of code I have written.

Ronnie Overby 2009-08-31 16:36:02

Hilarious. Lines of code is the most important thing in programming, thanks for the reminder. With that kind of idiotic rationale, perhaps you should consider how many lines of code are behind the HtmlAgilityPack.

Josh Stodola 2009-08-31 18:30:21

My wife called me an idiot this morning. I must really be one. Let the votes determine which is the best way to go, Josh.

Ronnie Overby 2009-08-31 19:14:34

@Josh - You are the first person on all of StackOverflow that I have ever seen argue with an OP so that he will accept or even acknowledge your answer. Also, I forgot to say that while, yes, the HtmlAgilityPack MAY execute some more instructions to do what I need than your code sample, it is also a proven library that has been endorsed by many developers.

Ronnie Overby 2009-08-31 20:40:21

I don't care if you accept my answer or not. Honestly. This "reputation" means nothing to me (nor should it). What I did not appreciate was you coming to my question about TDD and answering with "You are stupid" for absolutely no reason (except for me trying to help you, *so sorry*). And then you deleted it immediately like a silly chimp. I am sorry that high school sucked that bad for you, but it's time to get over it.

Josh Stodola 2009-08-31 21:27:53

I deleted that comment because I realized that it was wrong (to behave that way). Thanks for your help. What do you think of this for my new avatar? http://tinyurl.com/sillychimp You were my inspiration! I loved high school. Go ahead and get your last word in. I'm done with this.

Ronnie Overby 2009-08-31 22:10:26

Answer 7

A:

As you may have malformed HTML in the system: BeautifulSoup or similar could be used.

It is written in Python; I am not sure how it could be interfaced - using the .NET language IronPython?

Peter Mortensen 2009-08-28 21:23:02

Answer 8

A:

seagulf 2010-05-10 14:37:17

ansaurus

tags:

views:

answers:

How Can I strip HTML from Text in .NET?

related questions