tags:

views:

132

answers:

6

I am looking for a command-line tool that removes all comments from an input file and returns the stripped output. It'd be nice it supports popular programming languages like c, c++, python, php, javascript, html, css, etc. It has to be syntax-aware as opposed to regexp-based, since the latter will catch the pattern in source code strings as well. Is there any such tool?

I am fully aware that comments are useful information and often leaving them as they are is a good idea. It's just that my focus is on different use cases.

A: 

I don't know of such a tool - which isn't the same as saying there isn't one.

I once started to design one, but it quickly gets insane - not helped by the comment rules in C and C++.

/\
*  Comment? *\
/

(Answer: yes!)

"/\
* Comment? *\
/"

(Answer: no!)

To do the job reasonably, you have to be aware of:

  • Language comment conventions
  • Language quoted string conventions (Python and Perl are enough to drive you insane here)
  • Escape conventions (Shell gets you here - along with the quotes)

These combine to make the job tolerably close to impossible.

I ended up with a program, scc, to strip C and C++ comments. Its torture test includes worse examples than the comments shown above - and it does a decent job. But extending that to do shell or Perl or Python or (take your pick) was sufficiently non-trivial that I did not do it.

Jonathan Leffler
yup. i know. i could program such a tool myself with javacc and available syntax .jj files (there are tons of them available, for all popular languages). i was just wondering if anyone ever tried it yet. (all of my questions are never fully satisfactorily answered btw..)
OTZ
+2  A: 

cloc, a free Perl script, can do this.

Remove Comments from Source Code

How can you tell if cloc correctly identifies comments? One way to convince yourself cloc is doing the right thing is to use its --strip-comments option to remove comments and blank lines from files, then compare the stripped-down files to originals.

It supports a lot of languages.

Mark Rushakoff
I'm sorry, the tool misses a lot of comments -- even the most rudimentary comment instances -- on a test python file. Isn't it regexp-based? I do not know if you ever tried it, but it seems unusable for this purpose.
OTZ
@otz: It's a mature tool that hasn't failed me on any of my uses, including Python scripts. I do not think that you are using it correctly. For example, the command `perl /path/to/cloc-1.51.pl --strip-comments=n .` executed in a directory with a file `foo.py` will create a `foo.py.n` file with comments and blank lines removed. I would like to see an example of what you claim isn't working (and I'm sure cloc's developer would, too).
Mark Rushakoff
A: 

You might coax GNU Source-highlight into doing this.

lhf
correct me if i'm wrong, but it isn't fully syntax-aware, it seems to me. it is semi-syntax-aware as it is essentially regexp-based, but regexps are "structured" depending on languages.
OTZ
+2  A: 

What you want can be done with emacs scripting.

I wrote this script for you which does exactly what you want and can be easily extended to any language.

Filename: kill-comments

#!/usr/bin/python                                                         

import subprocess                                                         
import sys                                                                
import os                                                                 

target_file = sys.argv[1]                                                 

command =   "emacs -batch -l ~/.emacs-batch " + \                         
    target_file + \                                                       
    " --eval '(kill-comment (count-lines (point-min) (point-max)))'" + \  
    " -f save-buffer"                                                     

#to load a custom .emacs script (for more syntax support),                
#use -l <file> in the above command                                       

#print command                                                            

fnull = open(os.devnull, 'w')                                             
subprocess.call(command, shell = True, stdout = fnull, stderr = fnull)    
fnull.close()

to use it just call:

kill-comments <file-name>

To add any language to it edit ~/.emacs-batch and add that language's major mode. You can find syntax aware modes for basically everything you could want at http://www.emacswiki.org.

As an example, here is my ~/.emacs-batch file. It extends the above script to remove comments from javascript files. (I have javascript.el in my ~/.el directory)

(setq load-path (append (list (concat (getenv "HOME") "/.el")) load-path))    
(load "javascript")                                               
(setq auto-mode-alist (cons '("\\.js$" . javascript-mode) auto-mode-alist))

With the javascript addition this will remove comments from all the filetypes you mentioned as well as many more.

Good Luck and happy coding!

Robert McIntyre
lol. though it isn't robust enough as emacs highlighting on most languages is regexp-based (and therefore would catch certain weirdly formulated comments in strings, i am impressed by your idea :) What I want though, is a syntax-aware tool. Therefore, it'd have to have language parsers built in (JavaCC .jj .jjt files anyone?)
OTZ
You should try it on some test files before dismissing it. Emacs syntax highlighting is quite good.
Robert McIntyre
tried the (kill-comment ARG) on emacs. am satisfied by its accuracy in general. it's certainly a million times more accurate than that cloc tool suggested on this page. so thanks for that. but again, kill-comment function depends on the syntax highlighting in emacs, most of which are regexp-based. so it really depends on how good the <language>-mode.el is. but again, i like your idea.
OTZ
"Ideally" you would want an embedded compiler/interpreter for the language, but consider that actually determining what is a comment and what isn't could possibly be uncomputable given certain pathological languages (thus take forever), in which case you'd actually want to go for the regex-based approach to remove "standard" comments! For many languages a pull parse is not required to remove comments and the regex approach can be proven correct (this is what the pre-processor might do anyway). Remember to vote up answers you find useful / pick one as "the" answer as per the etiquette here.
Robert McIntyre
+1  A: 

Paul Dixon's response to this question on stripping comments from a script might be worth looking at.

Mark Baker
A: 

No such tool exists yet.

OTZ