ansaurus

Question

Using C# regular expressions to remove HTML tags

Answer 1

+6 A:

Regex regex = new Regex(@"</?\w+((\s+\w+(\s*=\s*(?:"".*?""|'.*?'|[^'"">\s]+))?)+\s*|\s*)/?>", RegexOptions.Singleline);

Source

Ryan Emerle 2009-04-25 00:31:07

Answer 2

A:

You can use RegEx.Replace() to accomplish this

RegEx objRegExp = new RegEx("<(.|\n)+?>");       
String strOutput = objRegExp.Replace(strHTML, String.Empty)

Jeremy 2009-04-25 00:31:27

What about carriage returns? (.|\n) won't match the DOS/Windows-style of line separator, "\r\n". But you don't need that hack anyway; just use the SingleLine flag like Ryan did. Oh, and it's "Regex", not "RegEx".

Alan Moore 2009-04-25 01:15:27

Answer 3

+5 A:

As often stated before, you should not use regular expressions to process XML or HTML documents. They perform not very well with HTML and XML documents because there is no way to express nested structures in a general way.

You could use the following.

String result = Regex.Replace(htmlDocument, @"<[^>]*>", String.Empty);

This will work for most case, but there will be cases - for example CDATA containing angle brakets - where this will not work as exspected.

Daniel Brückner 2009-04-25 00:31:48

This is a naive implementation.. That is, <div id="x<4>"> is unfortunately, valid html. Handles most sane cases though..

Ryan Emerle 2009-04-25 00:38:01

As stated, I am aware that this expression will fail in some cases. I am not even sure if the general case can be handled by any regular expression without errors.

Daniel Brückner 2009-04-25 00:49:39

No this will fail in all cases! its greedy.

Cipher 2009-04-25 01:04:40

@Cipher, why do you think greediness is a problem? Assuming the match starts at the beginning of a valid HTML tag, it will never extend beyond the end of that tag. That's what the [^>] is for.

Alan Moore 2009-04-25 01:37:55

Answer 4

+16 A:

The correct answer is don't do that, use the HTML Agility Pack.

JasonTrue 2009-04-25 00:51:44

HTML Agility Pack is not the answer to everything related to working with HTML (e.g. what if you only want to work with fragments of the HTML code?!).

PropellerHead 2009-10-23 07:23:16

It works pretty well with fragments of HTML, and it's the best option for the scenario described by the original poster. A Regex, on the other hand, only work with an idealized HTML and will break with perfectly valid HTML, because the grammar of HTML is not regular. If he were using Ruby, I still would have suggested nokogiri or hpricot, or beautifulsoup for Python. It's best to treat HTML like HTML, not some arbitrary text stream with no grammar.

JasonTrue 2009-10-23 15:54:37

Answer 5

A:

Very hard to do 100%. HTML is so flexible that browsers will render fine but a regex would fail. IE doesn't even need end tags. If you know you're going to have clean HTML you should be fine.

Chad Grant 2009-04-25 00:54:41

Answer 6

+4 A:

The question is too broad to be answered definitively. Are you talking about removing all tags from a real-world HTML document, like a web page? If so, you would have to:

remove the <!DOCTYPE declaration or <?xml prolog if they exist
remove all SGML comments
remove the entire HEAD element
remove all SCRIPT and STYLE elements
do Grabthar-knows-what with FORM and TABLE elements
remove the remaining tags
remove the <![CDATA[ and ]]> sequences from CDATA sections but leave their contents alone

That's just off the top of my head--I'm sure there's more. Once you've done all that, you'll end up with words, sentences and paragraphs run together in some places, and big chunks of useless whitespace in others.

But, assuming you're working with just a fragment and you can get away with simply removing all tags, here's the regex I would use:

@"(?></?\w+)(?>(?:[^>'""]+|'[^']*'|""[^""]*"")*)>"

Matching single- and double-quoted strings in their own alternatives is sufficient to deal with the problem of angle brackets in attribute values. I don't see any need to explicitly match the attribute names and other stuff inside the tag, like the regex in Ryan's answer does; the first alternative handles all of that.

In case you're wondering about those (?>...) constructs, they're atomic groups. They make the regex a little more efficient, but more importantly, they prevent runaway backtracking, which is something you should always watch out for when you mix alternation and nested quantifiers as I've done. I don't really think that would be a problem here, but I know if I don't mention it, someone else will. ;-)

This regex isn't perfect, of course, but it's probably as good as you'll ever need.

Alan Moore 2009-04-25 02:59:16

Answer 7

A:

thanks all for your valuable inputs!

2009-04-25 06:03:16

Answer 8

+1 A:

(XElement.Parse(xmlString)).Value :)

http://csharplive.wordpress.com

Faraz Masood Khan 2009-11-09 05:02:20

Implies you must have a extremely valid xmlString.

BrunoLM 2010-10-06 12:39:32

ansaurus

tags:

views:

answers:

Using C# regular expressions to remove HTML tags

related questions