views:

1520

answers:

5

Hello,

I'm parsing a source code file, and I want to remove all line comments (i.e. starting with "//") and multi-line comments (i.e. /..../). However, if the multi-line comment has at least one line-break in it (\n), I want the output to have exactly one line break instead.

For example, the code:

qwe /* 123
456 
789 */ asd

should turn exactly into:

qwe
asd

and not "qweasd" or:

qwe

asd

What would be the best way to do so? Thanks


EDIT: Example code for testing:

comments_test = "hello // comment\n"+\
                "line 2 /* a comment */\n"+\
                "line 3 /* a comment*/ /*comment*/\n"+\
                "line 4 /* a comment\n"+\
                "continuation of a comment*/ line 5\n"+\
                "/* comment */line 6\n"+\
                "line 7 /*********\n"+\
                "********************\n"+\
                "**************/\n"+\
                "line ?? /*********\n"+\
                "********************\n"+\
                "********************\n"+\
                "********************\n"+\
                "********************\n"+\
                "**************/\n"+\
                "line ??"

Expected results:

hello 
line 2 
line 3  
line 4
line 5
line 6
line 7
line ??
line ??
+1  A: 

Is this what you're looking for?

>>> print(s)
qwe /* 123
456
789 */ asd
>>> print(re.sub(r'\s*/\*.*\n.*\*/\s*', '\n', s, flags=re.S))
qwe
asd

This will work only for those comments that are more than one line, but will leave others alone.

sykora
Thanks. I actually also need to remove single-line comments of a multi-line form (e.g. "/*comment*/"). I can do it with a separate regex, but can you add this to yours?
Roee Adler
I think it would be simpler with a separate regex, such as r'/\*.*\*/' because of the re.S flag (see http://docs.python.org/library/re.html#re.S) and the fact that different replacements make sense ('\n' vs. '').
Matthew Flaschen
Also, I believe sykora's regex should have \s* rather than \s+
Matthew Flaschen
Also I'd worry about the greedyness of `.*`. I would nearly always use `.*?`. For instance if there were two single line comments on the same line, the greedyness could wipe out everything in between.
Joseph Pecoraro
You're right, it should be \s* and not \s+. And matching comments of the same start/end delimiters but spanning only one line would be better accomplished using a separate pattern, there's nothing really accomplished by trying to mash it into the same one, and the replacements would be tricky.
sykora
I also agree with Joseph regarding .*? here rather than .* (in both instances).
Matthew Flaschen
Another problem with this is that it will always match until the next line break, even if that would be outside the comment.
MizardX
+1  A: 

How about this:

re.sub(r'\s*/\*(.|\n)*?\*/\s*', '\n', s, re.DOTALL).strip()

It attacks leading whitespace, /*, any text and newline up until the first *\, then any whitespace after that.

Its a little twist on sykora's example but it is also non-greedy on the inside. You also might want to look into the Multiline option.

Joseph Pecoraro
I believe that results in multi-line comments taking up exactly one line being changed into a blank line, while Rax wants those to disappear.
Matthew Flaschen
A: 

See can-regular-expressions-be-used-to-match-nested-patterns - if you consider nested comments, regular expressions are not the solution.

gimel
First, I think he really means regex, not regular expressions. Second, for a simple application in which perfection is not required (out of millions of lines of source code, how many have nested /* */ comments), regex is a workable solution that's simpler than a real push-down automaton.
Matthew Flaschen
Any language I know that uses /*, */ comments do so in a non-nested fashion. The first /* comments all the way up until the first */. However, you do raise a valid point, basic regexes cannot handle balancing/nesting because they don't have enough memory. Fortunately this is not one of those cases.
Joseph Pecoraro
@Matthew, what is regex if it's not a regular expression?
paxdiablo
+4  A: 
comment_re = re.compile(
    r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?',
    re.DOTALL | re.MULTILINE
)

def comment_replacer(match):
    start,mid,end = match.group(1,2,3)
    if mid is None:
        # single line comment
        return ''
    elif start is not None or end is not None:
        # multi line comment at start or end of a line
        return ''
    elif '\n' in mid:
        # multi line comment with line break
        return '\n'
    else:
        # multi line comment without line break
        return ' '

def remove_comments(text):
    return comment_re.sub(comment_replacer, text)
  • (^)? will match if the comment starts at the beginning of a line, as long as the MULTILINE-flag is used.
  • [^\S\n] will match any whitespace character except newline. We don't want to match line breaks if the comment starts on it's own line.
  • /\*(.*?)\*/ will match a multi-line comment and capture the content. Lazy matching, so we don't match two or more comments. DOTALL-flag makes . match newlines.
  • //[^\n] will match a single-line comment. Can't use . because of the DOTALL-flag.
  • ($)? will match if the comment stops at the end of a line, as long as the MULTILINE-flag is used.

Examples:

>>> s = ("qwe /* 123\n"
         "456\n"
         "789 */ asd /* 123 */ zxc\n"
         "rty // fgh\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(s).splitlines()
... ) + '"'
"qwe"
"asd zxc"
"rty"
>>> comments_test = ("hello // comment\n"
...                  "line 2 /* a comment */\n"
...                  "line 3 /* a comment*/ /*comment*/\n"
...                  "line 4 /* a comment\n"
...                  "continuation of a comment*/ line 5\n"
...                  "/* comment */line 6\n"
...                  "line 7 /*********\n"
...                  "********************\n"
...                  "**************/\n"
...                  "line ?? /*********\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "**************/\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(comments_test).splitlines()
... ) + '"'
"hello"
"line 2"
"line 3 "
"line 4"
"line 5"
"line 6"
"line 7"
"line ??"
"line ??"

Edits:

  • Updated to new specification.
  • Added another example.
MizardX
I used this one, and since it worked fine, I did not try the rest. So apologies for the rest of the people that answered correctly.
Roee Adler
@MizardX: I would appreciate if you see my edit (to the question) and clarifications, thanks.
Roee Adler
+3  A: 

The fact that you have to even ask this question, and that the solutions given are, shall we say, less than perfectly readable :-) should be a good indication that REs are not the real answer to this question.

You would be far better, from a readability viewpoint, to actually code this up as a relatively simple parser.

Too often, people try to use REs to be "clever" (I don't mean that in a disparaging way), thinking that a single line is elegant, but all they end up with is an unmaintainable morass of characters. I'd rather have a fully commented 20-line solution that I can understand in an instant.

paxdiablo
@Pax: The reason I aimed at a regular expression is that I thought it will be more efficient. I have millions of code lines to analyze, and I'm trying to eliminate performance bottlenecks. Currently I have "readable" code doing the work, I thought I could beef up performance by moving to regex. Do you disagree with this logic? Thanks.
Roee Adler
RE's are never more efficient than a well-written parser *in compiled languages*. That's because you can use domain knowledge when you write the parser (more speed) but an RE engine has be be able to handle everything. In the case of Python (unless it has JIT), an RE will probably be faster since the RE engine will be machine language whereas an interpreted parser will be, well, interpreted. I still prefer readability over speed though. Compute time (running code) is a lot cheaper than person time (maintaining code). So no, I don't disagree but you need to be aware what you're sacrificing.
paxdiablo