views:

401

answers:

4

I've encountered the need to remove comments of the form:

<!--  Foo

      Bar  -->

I'd like to use a regular expression that matches anything (including line breaks) between the beginning and end 'delimiters.'

What would a good regex be for this task?

+5  A: 

The simple way :

Regex xmlCommentsRegex = new Regex("<!--.*?-->", RegexOptions.Singleline | RegexOptions.Compiled);

And a better way :

Regex xmlCommentsRegex = new Regex("<!--(?:[^-]|-(?!->))*-->", RegexOptions.Singleline | RegexOptions.Compiled);
Diadistis
For my simple test case, <!--(?:[^-]|-(?!->))*--> is equivalent to my own: <!--([\s\S]*?)-->Is mine missing something?
Charlie Salts
There is difference only in performance. According to my tests yours takes 118 steps to complete while mine takes 62 :)
Diadistis
i don't know about .net's regex library but many regex compilers have optimizations for .*? so that it's much faster than the naive case
ʞɔıu
+5  A: 

NONE. It cannot be described by the context free grammar, which the regular expression is based upon.

Let's say this thread is exported in XML. Your example (<!-- FOO Bar -->), if enclosed in CDATA, will be lost, while it's not exactly a comment.

yogman
+4  A: 

The 'proper' way would be to use XSLT and copy everything but comments.

Chris Nava
I have not much experience with XSLT - but that's something I might try in the future.
Charlie Salts
A: 

Parsing XML with regex is considered bad style. Use some XML parsing library.