ansaurus

Question

using regex to remove comments from source files

Answer 1

+1 A:

You are doing it wrong.

Regex is for Regular Languages, which C isn't.

Otto Allmendinger 2010-02-23 14:58:05

Of course one of the common expected differences between a lexer and a parser is that a lexer only supports a regular language. Not always true of course (e.g. see Ragel) as with regex. A good lexer can do the job, but as with using a parser, it seems like massive overkill just for comment stripping.

Steve314 2010-02-23 15:10:06

@Steve314, If by "overkill" you mean *totally the right tool for the job*, then yeah. All of the regexps posted here are extremely buggy and will not do the right thing when faced with valid, realistic C(++) code.

Mike Graham 2010-02-23 15:45:06

Read up about the lexer, removed my recommendation of a lexer

Otto Allmendinger 2010-02-23 16:50:44

@Mike - On second thoughts, I agree - but the specific reason hasn't been mentioned (though it's a special case of your "valid, realistic" point). I just thought about things that look like comment markers, but are actually just characters in string literals. Avoiding those without the right tools would be a nasty job. Grab an existing C lexer (as long as it preserves the whitespace) - not so bad.

Steve314 2010-02-23 17:13:31

@Mike - my own answer deleted as a result - considered harmful.

Steve314 2010-02-23 17:14:39

@Steve314, That is the obvious valid, realistic code. (Like the example I posted in reply to msanders earlier.)

Mike Graham 2010-02-23 17:52:53

Answer 2

+4 A:

I would suggest using a REAL parser like SimpleParse or PyParsing. SimpleParse requires that you actually know EBNF, but is very fast. PyParsing has its own EBNF-like syntax but that is adapted for Python and makes it a breeze to build powerfully accurate parsers.

Edit:

Here is an example of how easy it is to use PyParsing in this context:

>>> test = '/* spam * spam */ eggs'
>>> import pyparsing
>>> comment = pyparsing.nestedExpr("/*", "*/").suppress()
>>> print comment.transformString(test)         
' eggs'

Here is a more complex example using single and multi-line comments.

Before:

/*
 * multiline comments
 * abc 2323jklj
 * this is the worst C code ever!!
*/
void
do_stuff ( int shoe, short foot ) {
    /* this is a comment
     * multiline again! 
     */
    exciting_function(whee);
} /* extraneous comment */

After:

>>> print comment.transformString(code)   

void
do_stuff ( int shoe, short foot ) {

     exciting_function(whee);
}

It leaves an extra newline wherever it stripped comments, but that could be addressed.

jathanism 2010-02-23 15:00:10

Regex is bad, but parsing is overkill? I am confused; what else is there?

jathanism 2010-02-23 15:10:50

I was looking at the problem wrong - searching based on a simple alternation regex is much easier than writing a parser. That said, it doesn't address confusion caused by things in strings. A parser (or Lexer) as Mike commented may be exactly the right tool for the job.

Steve314 2010-02-23 17:17:12

Yeah, Regex is "easy" if your input is easy such as things with consistent format like IP addresses or phone numbers. For all other things: lexer.

jathanism 2010-02-23 19:38:25

I don't think it's leave an extra newline - the newline just isn't part of the comment so it's not stripped, and it's not necessarily safe to do so as newline *can* be used as significant whitespace in C

gnibbler 2010-02-24 01:35:54

Ah, that is a good observation and now that I look at it again I agree. :)

jathanism 2010-02-24 15:18:04

Answer 3

+1 A:

re.sub returns a string, so changing your code to the following will give results:

def removeComments(string):
    string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string
    return string

msanders 2010-02-23 15:04:47

`char *note = "You can make a one-line comment with //";` Oops.

Mike Graham 2010-02-23 15:47:38

Indeed. This only answers why the OP's function returned nothing.

msanders 2010-02-23 16:15:51

this is the technically correct answer to the question.Maybe using a phasor is a better way to solve my problem,but this made my code work.

Oxinabox 2010-02-24 07:41:16

Answer 4

A:

I would recommend you read this page that has a quite detailed analyzis of the problem and gives a good understanding on why your approach doesn't work: http://ostermiller.org/findcomment.html

Short version: The regex you are looking for is this:

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)

This should match both types of comment blocks. If you are having troubles following it read the page i linked.

MatsT 2010-02-23 15:07:30

Unless I've missed something, this would miss a comment delimiter spliced across lines (slash, backslash, new-line, asterisk or asterisk, backslash, newline, slash). Worse, that backslash can be generated as the trigraph sequence `??/` (though I'll admit trigraphs are pretty rare).

Jerry Coffin 2010-02-23 18:12:18

Answer 5

+1 A:

I see several things you might want to revise.

First, Python passes objects by value, but some object types are immutable. Strings and integers are among these immutable types. So if you pass a string to a function, any changes to the string you make within the function won't affect the string you passed in. You should try returning a string instead. Furthermore, within the removeComments() function, you need to assign the value returned by re.sub() to a new variable -- like any function that takes a string as an argument, re.sub() will not modify the string.

Second, I would echo what others have said about parsing C code. Regular expressions are not the best way to go here.

jhoon 2010-02-23 15:08:00

Answer 6

A:

As noted in one of my other comments, comment nesting isn't really the problem (in C, comments don't nest, though a few compilers to support nested comments anyway). The problem is with things like string literals, that can contain the exact same character sequence as a comment delimiter without actually being one.

As Mike Graham said, the right tool for the job is a lexer. A parser is unnecessary and would be overkill, but a lexer is exactly the right thing. As it happens, I posted a (partial) lexer for C (and C++) earlier this morning. It doesn't attempt to correctly identify all lexical elements (i.e. all keywords and operators) but it's entirely sufficient for stripping comments. It won't do any good on the "using Python" front though, as it's written entirely in C (it predates my using C++ for much more than experimental code).

Jerry Coffin 2010-02-23 18:24:03

Answer 7

A:

mystring="""
blah1 /* comments with
multiline */

blah2
blah3
// double slashes comments
blah4 // some junk comments

"""
for s in mystring.split("*/"):
    s=s[:s.find("/*")]
    print s[:s.find("//")]

output

$ ./python.py

blah1


blah2
blah3

ghostdog74 2010-02-24 00:31:44

ansaurus

tags:

views:

answers:

using regex to remove comments from source files

related questions