I need to parse some C++ files, and to make things easier for me, I thought about removing multiline comments. I tried the following regex : /(\/\*.*?\*\/)/
, using the multiline modifier, and it seems to work. Do you think there will be any case where it will fail?
views:
311answers:
3The following is going to hurt you:
std::cout << "Printing some /* source code */" << std::endl;
This is a kind example. Imagine the damage you could do if the string started a comment and didn't end it? You could end up deleting huge chunks of your code.
A regex may give you a good "quick-and-dirty" solution, and may work in your particular case (I urge you to perform one pass of "extract and print all matches" before you do a pass of "delete all matches" in order to make sure), but in the general case, you will need a much more sophisticated parser. You might be able to account for this situation with a regex, but it's going to get ugly.
EDIT: Thanks to @MSalters in the comments, I've realized that the problem you have involves a bit more than just the source files, though strictly speaking if you use macros with embedded comments you're asking for trouble. So after a bit of testing, it turns out there is already a tool installed on most machines with a C++ compiler that will weed out comments, and handle all the finicky string and macro issues for you. Use this on file.cpp
to get the output without comments (single- or multi-line):
cpp file.cpp
Sure, that will expand all the macros and #include
s, and may not have the same nice neat formatting you wanted, but it will easily deal with all macro, string, and other issues associated with comment finding. If you don't know, cpp
is the C preprocessor as a standalone executable (theoretically you can use #include
s and #define
s and such in any language with a relatively C-like syntax), so if you don't have it, you can get the same effect with GCC like this:
gcc -E file.cpp
(Change gcc
to g++
if you really care - it may handle #include <iostream>
better.)
Removing comments is, as far as I know, not strictly a part of the preprocessor, but most preprocessors do it in that stage to simplify the syntax of the actual language parser (well, GCC's preprocessor does, and that's all I have to test with). So if your compiler's preprocessor option will do this for you, and this is all you want done, stop rolling your own right now.
I apologize for not thinking of this sooner. I don't know how it escaped me.
Another example that it'll fail:
//****
some code
/*
comments
*/
In this case it will match everything except the first slash.
A regex can't do this. It just can't. Without seeing what you've written, it's hard to say whether it handles all the corner cases correctly, but my immediate guess is: "Probably not."
For a few examples, consider
// This is a single line\
comment
Line-splicing still happens inside of comments. Remember, also, that the backslash that continues the line could be created from a trigraph:
// This is also a single-line??/
comment
You also have to ensure against trying to parse preprocessor statements, or you could run into trouble. For example, this is probably intended to include all the headers in a specified directory:
#include <all_headers/*>
But if you handle it incorrectly, you're doing to delete everything to the end of the next comment...
Of course, to keep things interesting, that could also be created from a trigraph or digraph:
%:include <all_headers/*>
or even a combination of digraphs and trigraphs:
%:include<all_headers??/*>
Which, after you resolve the trigraph, doesn't contain a comment delimiter at all, being equivalent to:
#include <all_headers\*>