ansaurus

Question

Python regex question: stripping multi-line comments but maintaining a line break

Answer 1

+1 A:

Is this what you're looking for?

>>> print(s)
qwe /* 123
456
789 */ asd
>>> print(re.sub(r'\s*/\*.*\n.*\*/\s*', '\n', s, flags=re.S))
qwe
asd

This will work only for those comments that are more than one line, but will leave others alone.

sykora 2009-05-10 04:30:47

Thanks. I actually also need to remove single-line comments of a multi-line form (e.g. "/*comment*/"). I can do it with a separate regex, but can you add this to yours?

Roee Adler 2009-05-10 04:34:35

I think it would be simpler with a separate regex, such as r'/\*.*\*/' because of the re.S flag (see http://docs.python.org/library/re.html#re.S) and the fact that different replacements make sense ('\n' vs. '').

Matthew Flaschen 2009-05-10 04:52:56

Also, I believe sykora's regex should have \s* rather than \s+

Matthew Flaschen 2009-05-10 04:53:49

Also I'd worry about the greedyness of `.*`. I would nearly always use `.*?`. For instance if there were two single line comments on the same line, the greedyness could wipe out everything in between.

Joseph Pecoraro 2009-05-10 04:59:40

You're right, it should be \s* and not \s+. And matching comments of the same start/end delimiters but spanning only one line would be better accomplished using a separate pattern, there's nothing really accomplished by trying to mash it into the same one, and the replacements would be tricky.

sykora 2009-05-10 04:59:43

I also agree with Joseph regarding .*? here rather than .* (in both instances).

Matthew Flaschen 2009-05-10 05:05:22

Another problem with this is that it will always match until the next line break, even if that would be outside the comment.

MizardX 2009-05-10 05:11:48

Answer 2

+1 A:

How about this:

re.sub(r'\s*/\*(.|\n)*?\*/\s*', '\n', s, re.DOTALL).strip()

It attacks leading whitespace, /*, any text and newline up until the first *\, then any whitespace after that.

Its a little twist on sykora's example but it is also non-greedy on the inside. You also might want to look into the Multiline option.

Joseph Pecoraro 2009-05-10 04:56:47

I believe that results in multi-line comments taking up exactly one line being changed into a blank line, while Rax wants those to disappear.

Matthew Flaschen 2009-05-10 05:15:37

Answer 3

A:

See can-regular-expressions-be-used-to-match-nested-patterns - if you consider nested comments, regular expressions are not the solution.

gimel 2009-05-10 04:57:04

First, I think he really means regex, not regular expressions. Second, for a simple application in which perfection is not required (out of millions of lines of source code, how many have nested /* */ comments), regex is a workable solution that's simpler than a real push-down automaton.

Matthew Flaschen 2009-05-10 05:03:21

Any language I know that uses /*, */ comments do so in a non-nested fashion. The first /* comments all the way up until the first */. However, you do raise a valid point, basic regexes cannot handle balancing/nesting because they don't have enough memory. Fortunately this is not one of those cases.

Joseph Pecoraro 2009-05-10 05:04:25

@Matthew, what is regex if it's not a regular expression?

paxdiablo 2009-05-10 05:08:22

Answer 4

+4 A:

comment_re = re.compile(
    r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?',
    re.DOTALL | re.MULTILINE
)

def comment_replacer(match):
    start,mid,end = match.group(1,2,3)
    if mid is None:
        # single line comment
        return ''
    elif start is not None or end is not None:
        # multi line comment at start or end of a line
        return ''
    elif '\n' in mid:
        # multi line comment with line break
        return '\n'
    else:
        # multi line comment without line break
        return ' '

def remove_comments(text):
    return comment_re.sub(comment_replacer, text)

(^)? will match if the comment starts at the beginning of a line, as long as the MULTILINE-flag is used.
[^\S\n] will match any whitespace character except newline. We don't want to match line breaks if the comment starts on it's own line.
/\*(.*?)\*/ will match a multi-line comment and capture the content. Lazy matching, so we don't match two or more comments. DOTALL-flag makes . match newlines.
//[^\n] will match a single-line comment. Can't use . because of the DOTALL-flag.
($)? will match if the comment stops at the end of a line, as long as the MULTILINE-flag is used.

Examples:

>>> s = ("qwe /* 123\n"
         "456\n"
         "789 */ asd /* 123 */ zxc\n"
         "rty // fgh\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(s).splitlines()
... ) + '"'
"qwe"
"asd zxc"
"rty"
>>> comments_test = ("hello // comment\n"
...                  "line 2 /* a comment */\n"
...                  "line 3 /* a comment*/ /*comment*/\n"
...                  "line 4 /* a comment\n"
...                  "continuation of a comment*/ line 5\n"
...                  "/* comment */line 6\n"
...                  "line 7 /*********\n"
...                  "********************\n"
...                  "**************/\n"
...                  "line ?? /*********\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "**************/\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(comments_test).splitlines()
... ) + '"'
"hello"
"line 2"
"line 3 "
"line 4"
"line 5"
"line 6"
"line 7"
"line ??"
"line ??"

Edits:

Updated to new specification.
Added another example.

MizardX 2009-05-10 05:01:02

I used this one, and since it worked fine, I did not try the rest. So apologies for the rest of the people that answered correctly.

Roee Adler 2009-05-11 05:06:30

@MizardX: I would appreciate if you see my edit (to the question) and clarifications, thanks.

Roee Adler 2009-05-11 05:59:48

Answer 5

+3 A:

The fact that you have to even ask this question, and that the solutions given are, shall we say, less than perfectly readable :-) should be a good indication that REs are not the real answer to this question.

You would be far better, from a readability viewpoint, to actually code this up as a relatively simple parser.

Too often, people try to use REs to be "clever" (I don't mean that in a disparaging way), thinking that a single line is elegant, but all they end up with is an unmaintainable morass of characters. I'd rather have a fully commented 20-line solution that I can understand in an instant.

paxdiablo 2009-05-10 05:07:18

@Pax: The reason I aimed at a regular expression is that I thought it will be more efficient. I have millions of code lines to analyze, and I'm trying to eliminate performance bottlenecks. Currently I have "readable" code doing the work, I thought I could beef up performance by moving to regex. Do you disagree with this logic? Thanks.

Roee Adler 2009-05-10 05:40:21

RE's are never more efficient than a well-written parser *in compiled languages*. That's because you can use domain knowledge when you write the parser (more speed) but an RE engine has be be able to handle everything. In the case of Python (unless it has JIT), an RE will probably be faster since the RE engine will be machine language whereas an interpreted parser will be, well, interpreted. I still prefer readability over speed though. Compute time (running code) is a lot cheaper than person time (maintaining code). So no, I don't disagree but you need to be aware what you're sacrificing.

paxdiablo 2009-05-10 05:49:41

ansaurus

tags:

views:

answers:

Python regex question: stripping multi-line comments but maintaining a line break

related questions