ansaurus

Question

Answer 1

+2 A:

Before you implement this, you will need to create test cases for it first

Simple comments /* */, //, ///
Multi line comments /* This\nis\na\ntest*/
Comments after line of code var a = "apple"; // test or /* test */
Comments within comments /* This // is a test /, or // This / is a test */
Simple non comments that look like comments, and appears in quotes var comment= "/* This is a test*/", or var url = "http://stackoverflow.com";
Complex non comments taht look like comments: var abc = @" this /* \n is a comment in quote\n*/", with or without spaces between " and /* or */ and "

There are probably more cases out there.

Once you have all of them, then you can create a parsing rule for each of them, or group some of them.

Solving this with regular expression alone probably will be very hard and error-prone, hard to test, and hard to maintain by you and other programmers.

Holystream 2010-08-19 17:40:53

Holystream, I do have some of the test cases you mentioned, but not all. My sample above covers 1 (partially), 2, 3, and 4. 5 and 6 are good points which I had not considered.

Welton v3.50 2010-08-19 17:50:41

Holystream, I believe you are making it out to be harder than it is. Matching the two comment styles is really easy with regular expressions — in fact, the C# (and C++) lexer probably does that. This is in contrast to something like HTML, which is hard to match with regexes because HTML tags can nest and because they come in too many different varieties.

Timwi 2010-08-19 17:58:07

@Timwi: Actually, .NET uses a lexical analyzer. The comment symbols are just tokens. http://en.wikipedia.org/wiki/Lexical_analysis

chilltemp 2010-08-19 18:03:53

@Timwi: Can you please give me an example that works with the cases above? I am very interested to know a regular expression that pass those test cases. /\*(.*?)\*/|//.*?\r?\n failed a lot of those test cases.

Holystream 2010-08-19 18:17:53

@Holystream: Have you tried the regex in my answer? You seem to have removed two backslashes from it. If my regex fails, please provide a specific example in which it fails, and comment on my answer instead of this one. Thanks!

Timwi 2010-08-19 20:36:50

@chilltemp: That is what I said. “lexer” is short for “lexical analyzer”.

Timwi 2010-08-19 20:38:24

@Timwi: Thanks for the edited example. I would comment your post, but I don't have enough reputation points yet :)It seems to be working better, though it still failed on multiple line comments such as/* Line 1 * Line 2 * Line 3*/or var url = "http://stackoverflow.com"; // Stackoverflow website.

Holystream 2010-08-19 21:14:03

@Holystream: I tried both examples and they work fine for me. [Here is the full code I’ve used for you to play with.](http://csharp.pastebin.com/0aqBdFE5)

Timwi 2010-08-19 21:37:21

@Timwi: +1. Thanks, very educational.

Holystream 2010-08-19 22:52:14

Answer 2

+4 A:

Both of your regular expressions (for block and line comments) have bugs. If you want I can describe the bugs, but I felt it’s perhaps more productive if I write new ones, especially because I’m intending to write a single one that matches both.

The thing is, every time you have /* and // and literal strings “interfering” with each other, it is always the one that starts first that takes precedence. That’s very convenient because that’s exactly how regular expressions work: find the first match first.

So let’s define a regular expression that matches each of those four tokens:

var blockComments = @"/\*(.*?)\*/";
var lineComments = @"//(.*?)\r?\n";
var strings = @"""((\\[^\n]|[^""\n])*)""";
var verbatimStrings = @"@(""[^""]*"")+";

To answer the question in the title (strip comments), we need to:

Replace the block comments with nothing
Replace the line comments with a newline (because the regex eats the newline)
Keep the literal strings where they are.

Regex.Replace can do this easily using a MatchEvaluator function:

string noComments = Regex.Replace(input,
    blockComments + "|" + lineComments + "|" + strings + "|" + verbatimStrings,
    me => {
        if (me.Value.StartsWith("/*") || me.Value.StartsWith("//"))
            return me.Value.StartsWith("//") ? Environment.NewLine : "";
        // Keep the literal strings
        return me.Value;
    },
    RegexOptions.Singleline);

I ran this code on all the examples that Holystream provided and various other cases that I could think of, and it works like a charm. If you can provide an example where it fails, I am happy to adjust the code for you.

Timwi 2010-08-19 17:53:11

@Timwi: I do not need to extract the comments, just strip them out of my source script. I tried your code, and it worked well. Ideally, I'd like to remove any line completely, if the line only contained comments. e.g. no blank lines left where a comment was. However, this is not a requirement, just a formatting preference. Thanks.

Welton v3.50 2010-08-20 16:57:14

@Welton: Well, you could just run `Regex.Replace(@"^(\s*\r?\n){2,}", Environment.Newline, RegexOptions.Multiline)` on the result afterwards, but this will remove blank double-lines that *didn’t* have a comment in it too.

Timwi 2010-08-20 17:14:00

ansaurus

tags:

views:

answers:

Regex to strip line comments from C#...

related questions