ansaurus

Question

Answer 1

+11 A:

If it is just stripping all HTML tags from a string - This works reliably with regex as well. Replace:

<[^>]*>

with the empty string, globally. Don't forget to normalize the string afterwards, replacing:

[\s\r\n]+

with a single space, and trimming the result. Possibly replace HTML character entities back to the actual characters.

Tomalak 2009-04-24 13:03:10

Answer 2

+2 A:

Regex.Replace(htmlText, "<.*?>", string.Empty);

2009-04-24 13:06:11

Answer 3

A:

Doing this without utilizing regex in some way is like asking someone to fill a tire without using air...

alex 2009-04-24 13:07:02

That's not at all true. Regex is just a leaky abstraction of a state machine.

Rex M 2009-04-24 13:08:44

That's really not true. Of course, at least on my situation, it's not worth writing a full parser to tackle this problem, but I was hoping there was already some library built in asp.net.

daniel 2009-04-24 13:23:02

@Rex - I get your point, but I wasn't expecting Daniel to whip up a more efficient deterministic state machine?

alex 2009-04-24 13:32:13

Answer 4

+2 A:

Andrei Rinea 2009-04-24 17:54:27

Answer 5

+1 A:

string result = Regex.Replace(anytext, @"<(.|\n)*?>", string.Empty);

2009-05-14 20:26:38

Answer 6

+4 A:

Go download HTMLAgilyPack, now! ;) Download LInk

This allows you to load and parse HTML. Then you can navigate the DOM and extract the inner values of all attributes. Seriously, it will take you about 10 lines of code at the maximum. It is one of the greatest free .net libraries out there.

Here is a sample:

            string htmlContents = new System.IO.StreamReader(resultsStream,Encoding.UTF8,true).ReadToEnd();

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(htmlContents);
            if (doc == null) return null;

            string output = "";
            foreach (var node in doc.DocumentNode.ChildNodes)
            {
                output += node.InnerHtml;
            }

Serapth 2009-05-14 20:33:41

Answer 7

+5 A:

I've posted this on the asp.net forums, and it still seems to be one of the easiest solutions out there. I won't guarantee it's the fastest or most efficient, but it's pretty reliable. In .NET you can use the HTML Web Control objects themselves. All you really need to do is insert your string into a temporary HTML object such as a DIV, then use the built-in 'InnerText' to grab all text that is not contained within tags. See below for a simple C# example:


System.Web.UI.HtmlControls.HtmlGenericControl htmlDiv = new System.Web.UI.HtmlControls.HtmlGenericControl("div");
htmlDiv.InnerHtml = htmlString;
String plainText = htmlDiv.InnerText;

Michael Tipton 2009-11-05 17:16:51

ansaurus

tags:

views:

answers:

Asp.NET - Strip HTML Tags

related questions