tags:

views:

311

answers:

3

I need to parse some C++ files, and to make things easier for me, I thought about removing multiline comments. I tried the following regex : /(\/\*.*?\*\/)/, using the multiline modifier, and it seems to work. Do you think there will be any case where it will fail?

+8  A: 

The following is going to hurt you:

std::cout << "Printing some /* source code */" << std::endl;

This is a kind example. Imagine the damage you could do if the string started a comment and didn't end it? You could end up deleting huge chunks of your code.

A regex may give you a good "quick-and-dirty" solution, and may work in your particular case (I urge you to perform one pass of "extract and print all matches" before you do a pass of "delete all matches" in order to make sure), but in the general case, you will need a much more sophisticated parser. You might be able to account for this situation with a regex, but it's going to get ugly.

EDIT: Thanks to @MSalters in the comments, I've realized that the problem you have involves a bit more than just the source files, though strictly speaking if you use macros with embedded comments you're asking for trouble. So after a bit of testing, it turns out there is already a tool installed on most machines with a C++ compiler that will weed out comments, and handle all the finicky string and macro issues for you. Use this on file.cpp to get the output without comments (single- or multi-line):

cpp file.cpp

Sure, that will expand all the macros and #includes, and may not have the same nice neat formatting you wanted, but it will easily deal with all macro, string, and other issues associated with comment finding. If you don't know, cpp is the C preprocessor as a standalone executable (theoretically you can use #includes and #defines and such in any language with a relatively C-like syntax), so if you don't have it, you can get the same effect with GCC like this:

gcc -E file.cpp

(Change gcc to g++ if you really care - it may handle #include <iostream> better.)

Removing comments is, as far as I know, not strictly a part of the preprocessor, but most preprocessors do it in that stage to simplify the syntax of the actual language parser (well, GCC's preprocessor does, and that's all I have to test with). So if your compiler's preprocessor option will do this for you, and this is all you want done, stop rolling your own right now.

I apologize for not thinking of this sooner. I don't know how it escaped me.

Chris Lutz
I was doing a extraction pass first, and then a replace one. After seeing your examples, I think I'm better off writing a small parser chunk to handle this. Shouldn't be too complicated.
Geo
It's not unreasonable to at least try the extraction test pass and see what kind of data turns up. But a parser will be a much better solution in the long run. If you design it as a filter that takes C++ code and spits out commentless C++ code, all it really needs to recognize are strings and comments.
Chris Lutz
From my tests, all the data turned out good. But who knows how the sources will turn up in the future.
Geo
@Chris Lutz: Nope. You need to parse include files, too. And since macros can produce strings, you need to process those too. For both DEBUG and NDEBUG builds __in parallel__ !
MSalters
@MSalters - Fortunately we don't have to write our own preprocessors because there are tons of ways to get the compiler to do preprocessing for us that makes it infinitely easier. Of course, both `cpp` and `gcc -E -` and `g++ -E -` delete comments as part of the preprocessing step anyway, so maybe instead of writing our own parser all the OP really needs is to inspect code spit out by the preprocessor.
Chris Lutz
Geo
+5  A: 

Another example that it'll fail:

//**** 
some code
/*
  comments
*/

In this case it will match everything except the first slash.

Nick D
It might be reasonable to assume that we can (somewhat safely) strip out single-line comments before parsing multi-line comments (though even that seemingly easier task falls prey to the string trap), in which case this might come out okay. But +1 for a good catch.
Chris Lutz
+3  A: 

A regex can't do this. It just can't. Without seeing what you've written, it's hard to say whether it handles all the corner cases correctly, but my immediate guess is: "Probably not."

For a few examples, consider

// This is a single line\
comment

Line-splicing still happens inside of comments. Remember, also, that the backslash that continues the line could be created from a trigraph:

// This is also a single-line??/
comment

You also have to ensure against trying to parse preprocessor statements, or you could run into trouble. For example, this is probably intended to include all the headers in a specified directory:

#include <all_headers/*>

But if you handle it incorrectly, you're doing to delete everything to the end of the next comment...

Of course, to keep things interesting, that could also be created from a trigraph or digraph:

%:include <all_headers/*>

or even a combination of digraphs and trigraphs:

%:include<all_headers??/*>

Which, after you resolve the trigraph, doesn't contain a comment delimiter at all, being equivalent to:

#include <all_headers\*>
Jerry Coffin