views:

396

answers:

7

So I'm new to python, (i wrote my first bit of code I it today) (I'm relitivly experienced in C, and decent in C++, Java and assember - if abit out of practice) I'm, making a program to automate the writing of some C code, (I'm writing to parse stigns into enumerations with the same name) C's handleing of stings is, well it's not that great. So some people have been nagging me to try python (actually they have been nagging me to makepython my primairy language)

I'm findin python kinda wierd, it's very loose.

anyway I made a function that is supposed to remove c syle /* COMMENT */ and //COMMENT from a string: He's the code:

def removeComments(string):
           re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
           re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string

It's been a while since i used RegEx too, and even then i'e only had to use it less than a dozen times before.

So I tryed this code out.

str="/* spam * spam */ eggs"
removeComments(str)
print str

And it apparently did nothing.

Any suggestions as to what I've done wrong?

Theres a saying i've heard acouple of times: If you have a problem and you try to solve it with RegEx you endup with two problems

+1  A: 

You are doing it wrong.

Regex is for Regular Languages, which C isn't.

Otto Allmendinger
Of course one of the common expected differences between a lexer and a parser is that a lexer only supports a regular language. Not always true of course (e.g. see Ragel) as with regex. A good lexer can do the job, but as with using a parser, it seems like massive overkill just for comment stripping.
Steve314
@Steve314, If by "overkill" you mean *totally the right tool for the job*, then yeah. All of the regexps posted here are extremely buggy and will not do the right thing when faced with valid, realistic C(++) code.
Mike Graham
Read up about the lexer, removed my recommendation of a lexer
Otto Allmendinger
@Mike - On second thoughts, I agree - but the specific reason hasn't been mentioned (though it's a special case of your "valid, realistic" point). I just thought about things that look like comment markers, but are actually just characters in string literals. Avoiding those without the right tools would be a nasty job. Grab an existing C lexer (as long as it preserves the whitespace) - not so bad.
Steve314
@Mike - my own answer deleted as a result - considered harmful.
Steve314
@Steve314, That is the obvious valid, realistic code. (Like the example I posted in reply to msanders earlier.)
Mike Graham
+4  A: 

I would suggest using a REAL parser like SimpleParse or PyParsing. SimpleParse requires that you actually know EBNF, but is very fast. PyParsing has its own EBNF-like syntax but that is adapted for Python and makes it a breeze to build powerfully accurate parsers.

Edit:

Here is an example of how easy it is to use PyParsing in this context:

>>> test = '/* spam * spam */ eggs'
>>> import pyparsing
>>> comment = pyparsing.nestedExpr("/*", "*/").suppress()
>>> print comment.transformString(test)         
' eggs'

Here is a more complex example using single and multi-line comments.

Before:

/*
 * multiline comments
 * abc 2323jklj
 * this is the worst C code ever!!
*/
void
do_stuff ( int shoe, short foot ) {
    /* this is a comment
     * multiline again! 
     */
    exciting_function(whee);
} /* extraneous comment */

After:

>>> print comment.transformString(code)   

void
do_stuff ( int shoe, short foot ) {

     exciting_function(whee);
} 

It leaves an extra newline wherever it stripped comments, but that could be addressed.

jathanism
Regex is bad, but parsing is overkill? I am confused; what else is there?
jathanism
I was looking at the problem wrong - searching based on a simple alternation regex is much easier than writing a parser. That said, it doesn't address confusion caused by things in strings. A parser (or Lexer) as Mike commented may be exactly the right tool for the job.
Steve314
Yeah, Regex is "easy" if your input is easy such as things with consistent format like IP addresses or phone numbers. For all other things: lexer.
jathanism
I don't think it's leave an extra newline - the newline just isn't part of the comment so it's not stripped, and it's not necessarily safe to do so as newline *can* be used as significant whitespace in C
gnibbler
Ah, that is a good observation and now that I look at it again I agree. :)
jathanism
+1  A: 

re.sub returns a string, so changing your code to the following will give results:

def removeComments(string):
    string = re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,string) # remove all occurance streamed comments (/*COMMENT */) from string
    string = re.sub(re.compile("//.*?\n" ) ,"" ,string) # remove all occurance singleline comments (//COMMENT\n ) from string
    return string
msanders
`char *note = "You can make a one-line comment with //";` Oops.
Mike Graham
Indeed. This only answers why the OP's function returned nothing.
msanders
this is the technically correct answer to the question.Maybe using a phasor is a better way to solve my problem,but this made my code work.
Oxinabox
A: 

I would recommend you read this page that has a quite detailed analyzis of the problem and gives a good understanding on why your approach doesn't work: http://ostermiller.org/findcomment.html

Short version: The regex you are looking for is this:

(/\*([^*]|[\r\n]|(\*+([^*/]|[\r\n])))*\*+/)|(//.*)

This should match both types of comment blocks. If you are having troubles following it read the page i linked.

MatsT
Unless I've missed something, this would miss a comment delimiter spliced across lines (slash, backslash, new-line, asterisk or asterisk, backslash, newline, slash). Worse, that backslash can be generated as the trigraph sequence `??/` (though I'll admit trigraphs are pretty rare).
Jerry Coffin
+1  A: 

I see several things you might want to revise.

First, Python passes objects by value, but some object types are immutable. Strings and integers are among these immutable types. So if you pass a string to a function, any changes to the string you make within the function won't affect the string you passed in. You should try returning a string instead. Furthermore, within the removeComments() function, you need to assign the value returned by re.sub() to a new variable -- like any function that takes a string as an argument, re.sub() will not modify the string.

Second, I would echo what others have said about parsing C code. Regular expressions are not the best way to go here.

jhoon
A: 

As noted in one of my other comments, comment nesting isn't really the problem (in C, comments don't nest, though a few compilers to support nested comments anyway). The problem is with things like string literals, that can contain the exact same character sequence as a comment delimiter without actually being one.

As Mike Graham said, the right tool for the job is a lexer. A parser is unnecessary and would be overkill, but a lexer is exactly the right thing. As it happens, I posted a (partial) lexer for C (and C++) earlier this morning. It doesn't attempt to correctly identify all lexical elements (i.e. all keywords and operators) but it's entirely sufficient for stripping comments. It won't do any good on the "using Python" front though, as it's written entirely in C (it predates my using C++ for much more than experimental code).

Jerry Coffin
A: 
mystring="""
blah1 /* comments with
multiline */

blah2
blah3
// double slashes comments
blah4 // some junk comments

"""
for s in mystring.split("*/"):
    s=s[:s.find("/*")]
    print s[:s.find("//")]

output

$ ./python.py

blah1


blah2
blah3
ghostdog74