tags:

views:

98

answers:

2

I'm working on a routine to strip block or line comments from some C# code. I have looked at the other examples on the site, but haven't found the exact answer that I'm looking for.

I can match block comments (/* comment */) in their entirety using this regular expression with RegexOptions.Singleline:

(/\*[\w\W]*\*/)

And I can match line comments (// comment) in their entirety using this regular expression with RegexOptions.Multiline:

(//((?!\*/).)*)(?!\*/)[^\r\n]

Note: I'm using [^\r\n] instead of $ because $ is including \r in the match, too.

However, this doesn't quite work the way I want it to.

Here is my test code that I'm matching against:

// remove whole line comments
bool broken = false; // remove partial line comments
if (broken == true)
{
    return "BROKEN";
}
/* remove block comments
else
{
    return "FIXED";
} // do not remove nested comments */ bool working = !broken;
return "NO COMMENT";

The block expression matches

/* remove block comments
else
{
    return "FIXED";
} // do not remove nested comments */

which is fine and good, but the line expression matches

// remove whole line comments
// remove partial line comments

and

// do not remove nested comments

Also, if I do not have the */ positive lookahead in the line expression twice, it matches

// do not remove nested comments *

which I really don't want.

What I want is an expression that will match characters, starting with //, to the end of line, but does not contain */ between the // and end of line.

Also, just to satisfy my curiosity, can anyone explain why I need the lookahead twice? (//((?!\*/).)*)[^\r\n] and (//(.)*)(?!\*/)[^\r\n] will both include the *, but (//((?!\*/).)*)(?!\*/)[^\r\n] and (//((?!\*/).)*(?!\*/))[^\r\n] won't.

+2  A: 

Before you implement this, you will need to create test cases for it first

  1. Simple comments /* */, //, ///
  2. Multi line comments /* This\nis\na\ntest*/
  3. Comments after line of code var a = "apple"; // test or /* test */
  4. Comments within comments /* This // is a test /, or // This / is a test */
  5. Simple non comments that look like comments, and appears in quotes var comment= "/* This is a test*/", or var url = "http://stackoverflow.com";
  6. Complex non comments taht look like comments: var abc = @" this /* \n is a comment in quote\n*/", with or without spaces between " and /* or */ and "

There are probably more cases out there.

Once you have all of them, then you can create a parsing rule for each of them, or group some of them.

Solving this with regular expression alone probably will be very hard and error-prone, hard to test, and hard to maintain by you and other programmers.

Holystream
Holystream, I do have some of the test cases you mentioned, but not all. My sample above covers 1 (partially), 2, 3, and 4. 5 and 6 are good points which I had not considered.
Welton v3.50
Holystream, I believe you are making it out to be harder than it is. Matching the two comment styles is really easy with regular expressions — in fact, the C# (and C++) lexer probably does that. This is in contrast to something like HTML, which is hard to match with regexes because HTML tags can nest and because they come in too many different varieties.
Timwi
@Timwi: Actually, .NET uses a lexical analyzer. The comment symbols are just tokens. http://en.wikipedia.org/wiki/Lexical_analysis
chilltemp
@Timwi: Can you please give me an example that works with the cases above? I am very interested to know a regular expression that pass those test cases. /\*(.*?)\*/|//.*?\r?\n failed a lot of those test cases.
Holystream
@Holystream: Have you tried the regex in my answer? You seem to have removed two backslashes from it. If my regex fails, please provide a specific example in which it fails, and comment on my answer instead of this one. Thanks!
Timwi
@chilltemp: That is what I said. “lexer” is short for “lexical analyzer”.
Timwi
@Timwi: Thanks for the edited example. I would comment your post, but I don't have enough reputation points yet :)It seems to be working better, though it still failed on multiple line comments such as/* Line 1 * Line 2 * Line 3*/or var url = "http://stackoverflow.com"; // Stackoverflow website.
Holystream
@Holystream: I tried both examples and they work fine for me. [Here is the full code I’ve used for you to play with.](http://csharp.pastebin.com/0aqBdFE5)
Timwi
@Timwi: +1. Thanks, very educational.
Holystream
+4  A: 

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

  • Replace the block comments with nothing
  • Replace the line comments with a newline (because the regex eats the newline)
  • Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,
    blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
    me => {
        if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
            return me.Value.StartsWith("//") ? Environment.NewLine : "";
        // Keep the literal strings
        return me.Value;
    },
    RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

Timwi
@Timwi: I do not need to extract the comments, just strip them out of my source script. I tried your code, and it worked well. Ideally, I'd like to remove any line completely, if the line only contained comments. e.g. no blank lines left where a comment was. However, this is not a requirement, just a formatting preference. Thanks.
Welton v3.50
@Welton: Well, you could just run `Regex.Replace(@"^(\s*\r?\n){2,}", Environment.Newline, RegexOptions.Multiline)` on the result afterwards, but this will remove blank double-lines that *didn’t* have a comment in it too.
Timwi