tags:

views:

1891

answers:

6

Can anyone get me with the regular expression to strip multiline comments and single line comments in a file?

eg:

                  " WHOLE "/*...*/" HAS TO BE STRIPED OFF....."

1.   /* comment */
2.   /* comment1 */  code   /* comment2 */ #both /*comment1*/ and /*comment2*/ 
                                             #has to striped off and rest should 
                                                 #remain.
3.   /*.........
       .........
       .........
       ......... */

i realy appreciate you if u do this need.... thanks in advance.

+8  A: 

As often in Perl, you can reach for the CPAN: Regexp::Common::Comment should help you. The one language I found that uses the comments you described is Nickle, but maybe PHP comments would be OK (// can also start a single-line comment).

Note that in any case, using regexps to strip out comment is dangerous, a full-parser for the language is much less risky. A regexp-parser for example is likely to get confused by something like print "/*";.

mirod
A: 

Including tests:

use strict;
use warnings;
use Test::More qw(no_plan);
sub strip_comments {
  my $string=shift;
  $string =~ s#/\*.*?\*/##sg; #strip multiline C comments
  return $string;
}
is(strip_comments('a/* comment1 */  code   /* comment2 */b'),'a  code   b');
is(strip_comments('a/* comment1 /* comment2 */b'),'ab');
is(strip_comments("a/* comment1\n\ncomment */ code /* comment2 */b"),'a code b');
Alexandr Ciornii
great....thanks alot
lokesh
great....thanks alot
lokesh
Will mess up /* or */ appearing in a string. E.g. the string "This /* string" does not include a comment start.
Richard
As well as not handling comment characters in strings (or even multi-character character constants), it also does not handle backslash-newline splicing which permits the opening slash to be followed by backslash, newline and then asterisk, for example. Also does not handle C++ comments (which can also have backslash-newline splicing). And it doesn't handle trigraphs - the only relevant one is '??/' which means backslash. How much this matters depends on how bullet-proof your code needs to be.
Jonathan Leffler
This is the wrong answer. Don't do it this way.
brian d foy
mirod's answer is much better.
Chris Huang-Leaver
A: 

Remove /* */ comments (including multi-line)

s/\/\*.*?\*\///gs

I post this because it is simple, however I believe it will trip up on embedded comments like

/* sdafsdfsdf /*sda asd*/ asdsdf */

But as they are fairly uncommon I prefer the simple regex.

gacrux
great... works fine... thanks alot..
lokesh
great..thank u...
lokesh
Read my answer to see why this is wrong.
brian d foy
+3  A: 

Isn't this a FAQ?

perldoc -q comment

Found in perlfaq6:

How do I use a regular expression to strip C style comments from a file? While this actually can be done, it's much harder than you'd think. For example, this one-liner ...

Sinan Ünür
You can link to perlfaqs at http://faq.perl.org (always the latest version), or perldoc.perl.org. That way those sites get good google juice for the people who search for answers. :)
brian d foy
@brian Thanks for catching that. I had meant to replace that with the link.
Sinan Ünür
+6  A: 

From perlfaq6 "How do I use a regular expression to strip C style comments from a file?":


While this actually can be done, it's much harder than you'd think. For example, this one-liner

perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c

will work in many but not all cases. You see, it's too simple-minded for certain kinds of C programs, in particular, those with what appear to be comments in quoted strings. For that, you'd need something like this, created by Jeffrey Friedl and later modified by Fred Curtis.

$/ = undef;
$_ = <>;
s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse;
print;

This could, of course, be more legibly written with the /x modifier, adding whitespace and comments. Here it is expanded, courtesy of Fred Curtis.

s{
   /\*         ##  Start of /* ... */ comment
   [^*]*\*+    ##  Non-* followed by 1-or-more *'s
   (
     [^/*][^*]*\*+
   )*          ##  0-or-more things which don't start with /
               ##    but do end with '*'
   /           ##  End of /* ... */ comment

 |         ##     OR  various things which aren't comments:

   (
     "           ##  Start of " ... " string
     (
       \\.           ##  Escaped char
     |               ##    OR
       [^"\\]        ##  Non "\
     )*
     "           ##  End of " ... " string

   |         ##     OR

     '           ##  Start of ' ... ' string
     (
       \\.           ##  Escaped char
     |               ##    OR
       [^'\\]        ##  Non '\
     )*
     '           ##  End of ' ... ' string

   |         ##     OR

     .           ##  Anything other char
     [^/"'\\]*   ##  Chars which doesn't start a comment, string or escape
   )
 }{defined $2 ? $2 : ""}gxse;

A slight modification also removes C++ comments, possibly spanning multiple lines using a continuation character:

 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//([^\\]|[^\n][\n]?)*?\n|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $3 ? $3 : ""#gse;
brian d foy
brian, that functionality almost could be added to perl, it seems to be asked so much. at least IMO.
Paul Nathan
...and this is why we have tools like yacc, flex, bison, ANTLR, etc. This is something you need a full-blown parser for, not a regex.
Adam Rosenfield
@Paul: That functionality is already in Perl. Perl is a general purpose language. We don't want to add built-in features for every task that comes along. That's the job for modules.
brian d foy
A: 

There is also a non-perl answer: use the program stripcmt:

StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the commandline.

hlovdal