views:

358

answers:

4

I've been trying to figure out a regex to allow me to search for a particular string while automatically skipping comments. Anyone have an RE like this or know of one? It doesn't even need to be sophisticated enough to skip #if 0 blocks; I just want it to skip over // and /* blocks. The converse, that is only search inside comment blocks, would be very useful too.

Environment: VS 2003

Language: C++

+3  A: 

This is a harder problem than it might at first appear, since you need to consider comment tokens inside strings, comment tokens that are themselves commented out etc.

I wrote a string and comment parser for C#, let me see if I can dig out something that will help... I'll update if I find anything.

EDIT: ... ok, so I found my old 'codemasker' project. Turns out that I did this in stages, not with a single regex. Basically I inch through a source file looking for start tokens, when I find one I then look for an end-token and mask everything in between. This takes into account the context of the start token... if you find a token for "string start" then you can safely ignore comment tokens until you find the end of the string, and vice versa. Once the code is masked (I used guids as masks, and a hashtable to keep track) then you can safely do your search and replace, then finally restore the masked code.

Hope that helps.

Ed Guiness
+2  A: 

Be especially careful with strings. Strings often have escape sequences which you also have to respect while you're finding the end of them.

So e.g. "This is \"a test\"". You cannot blindly look for a double-quote to terminate. Also beware of `"This is \\", which shows that you cannot just say "unless double-quote is preceded by a backslash."

In summary, make some brutal unit tests!

Jason Cohen
+1  A: 

I would make a copy and strip out the comments first, then search the string the regular way.

asksol
+2  A: 

A regexp is not the best tool for the job.

Perl FAQ:

C comments:

#!/usr/bin/perl
$/ = undef;
$_ = <>; 

s#/\*[^*]*\*+([^/*][^*]*\*+)*/|([^/"']*("[^"\\]*(\\[\d\D][^"\\]*)*"[^/"']*|'[^'\\]*(\\[\d\D][^'\\]*)*'[^/"']*|/+[^*/][^/"']*)*)#$2#g;
print;

C++ comments:

#!/usr/local/bin/perl
$/ = undef;
$_ = <>;

s#//(.*)|/\*[^*]*\*+([^/*][^*]*\*+)*/|"(\\.|[^"\\])*"|'(\\.|[^'\\])*'|[^/"']+#  $1 ? "/*$1 */" : $& #ge;
print;
J.F. Sebastian