views:

155

answers:

2

I'd like to filter out (mostly one-line) comments from (mostly valid) JavaScript using python's re module. For example:

// this is a comment
var x = 2 // and this is a comment too
var url = "http://www.google.com/" // and "this" too
url += 'but // this is not a comment' // however this one is
url += 'this "is not a comment' + " and ' neither is this " // only this

I'm now trying this for more than half an hour without any success. Can anyone please help me?

EDIT 1:

foo = 'http://stackoverflow.com/' // these // are // comments // too //

EDIT 2:

bar = 'http://no.comments.com/'
+1  A: 

It might be easier to parse if you had explicit semi-colons.

In any case, this works:

import re

rx = re.compile(r'.*(//(.*))$')

lines = ["// this is a comment", 
    "var x = 2 // and this is a comment too",
    """var url = "http://www.google.com/" // and "this" too""",
    """url += 'but // this is not a comment' // however this one is""",
    """url += 'this "is not a comment' + " and ' neither is this " // only this""",]

for line in lines: 
    print rx.match(line).groups()

Output of the above:

('// this is a comment', ' this is a comment')
('// and this is a comment too', ' and this is a comment too')
('// and "this" too', ' and "this" too')
('// however this one is', ' however this one is')
('// only this', ' only this')

I'm not sure what you're doing with the javascript after removing the comments, but JSMin might help. It removes comments well enough anyway, and there is an implementation in python.

Seth
Thanks, this is definitely a +1. Let me modify my question a bit now :)
Attila Oláh
Also, the JavaScript is not written by me, so unfortunately I can't guarantee the explicit semicolons...
Attila Oláh
Ehm, no, this will only work if there is always a comment at the end of the line, and when there's no // inside the comment itself. Both `var url = "http://www"` and `// comments are started with //` will fail.
Thomas Wouters
@Thomas Well, it works for the specified inputs. As @Anon mentioned, a real parser is needed here to catch everything properly.
Seth
Thanks, the Python implementation of JSMin will fit my needs for now.
Attila Oláh
+2  A: 

My regex powers had gone a bit stale so I've used your question to fresh what I remember. It became a fairly large regex mostly because I also wanted to filter multi-line comments.

import re

reexpr = r"""
    (                           # Capture code
        "(?:\\.|[^"\\])*"       # String literal
        |
        '(?:\\.|[^'\\])*'       # String literal
        |
        (?:[^/\n"']|/[^/*\n"'])+ # Any code besides newlines or string literals
        |
        \n                      # Newline
    )|
    (/\*  (?:[^*]|\*[^/])*   \*/)        # Multi-line comment
    |
    (?://(.*)$)                 # Comment
    $"""
rx = re.compile(reexpr, re.VERBOSE + re.MULTILINE)

This regex matches with three different subgroups. One for code and two for comment contents. Below is a example of how to extract only the code of a sample.

code = r"""// this is a comment
var x = 2 * 4 // and this is a comment too
var url = "http://www.google.com/" // and "this" too
url += 'but // this is not a comment' // however this one is
url += 'this "is not a comment' + " and ' neither is this " // only this

bar = 'http://no.comments.com/' // these // are // comments
bar = 'text // string \' no // more //\\' // comments
bar = 'http://no.comments.com/'
bar = /var/ // comment

/* comment 1 */
bar = open() /* comment 2 */
bar = open() /* comment 2b */// another comment
bar = open( /* comment 3 */ file) // another comment 
"""

parts = rx.findall(code)
print ''.join( [ x[0] for x in parts ] )
driax
Wow, this is even one step ahead of the question, but this is exactly what I need! Thank you very much for taking your time to solve this problem!
Attila Oláh
I edited the regex because it didn't match '*' like in 'x = 4 * 5', which became 'x = 4 5)'
driax
Doesn’t work for `/* / */` or `/* // */`. Fix: Replace `/\\* (?:\\*?[^/]|\n)* \\*/` with `/\\* (?:[^*]|\\*[^/])* \\*/`.
Gumbo
Thanks Gumbo, I've changed the regex.
driax