ansaurus

Question

C# How to delete XML/HTML comments with regular expression

Answer 1

+4 A:

Please don't use regular expressions to work with markup languages - you need to use a better tool that is built for that kind of job.

Use the Html Agiliy Pack instead. I even found this article in which a reader (named Simon Mourier) comments with a function that uses the Html Agility Pack to remove comments from a document:

Simon Mourier said:

This is a sample code to remove comments:

static void Main(string[] args) 
{ 
  HtmlDocument doc = new HtmlDocument(); 
  doc.Load("filewithcomments.htm"); 
  doc.Save(Console.Out); // show before 
  RemoveComments(doc.DocumentNode); 
  doc.Save(Console.Out); // show after 
} 

static void RemoveComments(HtmlNode node) 
{ 
  if (node.NodeType == HtmlNodeType.Comment) 
  { 
   node.ParentNode.RemoveChild(node); 
   return; 
  } 

  if (!node.HasChildNodes) 
   return; 

  foreach(HtmlNode subNode in node.ChildNodes) 
  { 
   RemoveComments(subNode); 
  } 
}

Andrew Hare 2009-08-20 05:09:44

I saw the similar comment of yours in other thread.I am not convinced why I should use a better tool for occasional Web scraping extracting hrefs between start and end marker on html page some of them commented.

MicMit 2009-08-20 06:10:19

Andrew is right. You cannot parse [X][HT]ML with regex, unless (a) you know in advance that a very restricted and fixed set of content is being used or (b) you don't mind lots of mistakes in your results. Parsing comments is less likely to break than parsing links, since there is much more variability in formatting for links, but it's still unreliable.

bobince 2009-08-20 09:29:25

The code sample doesn't work. You can't modify the nodes while enumerating over the collection

Harry 2010-09-09 12:17:06

Answer 2

A:

This one works for me:

<!--(\n|.)*-->

But I think you could use normal XML document for the XML or otherwise HtmlAgilityPack for HTML. Highly not recommending to parse markup using RegEx.

Dmytrii Nagirniak 2009-08-20 05:11:06

You should put a non-greedy quantifier on your multiplier, ie. `` Also, this problem can be solved by simply adding the SingleLine flag which modifies . to accept newlines too..

Matthew Scharley 2009-08-20 05:23:24

@Matthew. Yes. I agree. You theoretically are correct. But I tried the SingleLine flag and it doesn't change the result. Also both non-greedy and greedy work. Tested using radsoftware.com.au/?from=RegexDesigner

Dmytrii Nagirniak 2009-08-20 06:14:17

Answer 3

A:

http://www.codeplex.com/htmlagilitypack

SUMMARY: It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Have a look at - http://stackoverflow.com/questions/787932/using-c-regular-expressions-to-remove-html-tags

adatapost 2009-08-20 05:12:03

Answer 4

+5 A:

Change it to RegExOptions.Singleline instead and it'll work just fine. When not in Singleline mode, the dot matches any character, except newline.

Note that Singleline and Multiline are not mutually exclusive. They do two separate things. To quote MSDN:

Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.

Single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).

Other people have already suggested the HTML Agility Pack. I just felt you should have an explanation on why your Regex wouldn't work :)

Thorarin 2009-08-20 05:17:44

+1 for answering the actual question.

womp 2009-08-20 05:24:48

Yes, it works. At first I didn't provide the third parameter and it didn't work and I thought RegExOptions.SingleLine is implied, but looks like Multiline is default.

MicMit 2009-08-20 06:27:47

Singleline and Multiline are not opposites, no matter what the names seem to imply. Both options are off by default, and setting one has no effect on the other. Singleline changes the behavior of the dot metacharacter, and Multiline changes the behavior of the `^` and `$` anchors.

Alan Moore 2009-08-20 06:56:42

@Alan M: indeed, my answer was poorly worded in that respect. I've updated it a little.

Thorarin 2009-08-20 10:04:34

ansaurus

tags:

views:

answers:

C# How to delete XML/HTML comments with regular expression

related questions