views:

3981

answers:

7

Using Asp.NET, how do I strip the HTML tags from a given string reliably (i.e. not using regex)? Like PHP's strip_tags.

Example: for the string: "<ul><li>Hello</li></ul>" return "Hello".

I didn't want to write some kind of parser, so I looked for it on the stardand library, but couldn't find anything.

Thanks!

+11  A: 

If it is just stripping all HTML tags from a string - This works reliably with regex as well. Replace:

<[^>]*>

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Possibly replace HTML character entities back to the actual characters.

Tomalak
+2  A: 
Regex.Replace(htmlText, "<.*?>", string.Empty);
A: 

Doing this without utilizing regex in some way is like asking someone to fill a tire without using air...

alex
That's not at all true. Regex is just a leaky abstraction of a state machine.
Rex M
That's really not true. Of course, at least on my situation, it's not worth writing a full parser to tackle this problem, but I was hoping there was already some library built in asp.net.
daniel
@Rex - I get your point, but I wasn't expecting Daniel to whip up a more efficient deterministic state machine?
alex
+2  A: 
Andrei Rinea
+1  A: 

string result = Regex.Replace(anytext, @"<(.|\n)*?>", string.Empty);

+4  A: 

Go download HTMLAgilyPack, now! ;) Download LInk

This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.

Here is a sample:

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerHtml;
            }
Serapth
+5  A: 

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable. In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;
Michael Tipton