ansaurus

Question

How to match a comment unless it's in a quoted string?

Answer 1

+2 A:

Use a parser, determine it char-by-char.

Kickoff example:

StringBuilder builder = new StringBuilder();
boolean quoted = false;

for (String line : string.split("\\n")) {
    for (int i = 0; i < line.length(); i++) {
        char c = line.charAt(i);
        if (c == '"') {
            quoted = !quoted;
        }
        if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
            break;
        } else {
            builder.append(c);
        }
    }
    builder.append("\n");
}

String parsed = builder.toString();
System.out.println(parsed);

BalusC 2010-02-17 21:48:42

@BalusC, this may cause the OP to think of the problem a bit too easy though... @Confused, think of what you should do when you encounter a ` \ `. If you encounter a ` \ ` and then a `"`, should you still flip the `quoted` flag? And think of when `//` (or quotes) are inside multi line comment blocks.

Bart Kiers 2010-02-17 22:10:52

@Bart K.: it was just a kickoff example to see the light :)

BalusC 2010-02-17 22:32:45

@BalusC, yes, I know. Just felt the need to warn the OP... :)

Bart Kiers 2010-02-17 22:39:29

Answer 2

A:

You can't tell using regex if you are in double quoted string or not. In the end regex is just a state machine (sometimes extended abit). I would use a parser as provided by BalusC or this one.

If you want know why the regex are limited read about formal grammars. A wikipedia article is a good start.

Piotr Czapla 2010-02-17 22:04:36

Answer 3

+4 A:

Instead of using a parser that parses an entire Java source file, or writing something yourself that parses only those parts you're interested in, you could use some 3rd party tool like ANTLR.

ANTLR has the ability to define only those tokens you are interested in (and of course the tokens that can mess up your token-stream like multi-line comments and String- and char literals). So you only need to define a lexer (another word for tokenizer) that correctly handles those tokens.

This is called a grammar. In ANTLR, such a grammar could look like this:

lexer grammar FuzzyJavaLexer;

options{filter=true;}

SingleLineComment
  :  '//' ~( '\r' | '\n' )*
  ;

MultiLineComment
  :  '/*' .* '*/'
  ;

StringLiteral
  :  '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
  ;

CharLiteral
  :  '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
  ;

Save the above in a file called FuzzyJavaLexer.g. Now download ANTLR 3.2 here and save it in the same folder as your FuzzyJavaLexer.g file.

Execute the following command:

java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g

which will create a FuzzyJavaLexer.java source class.

Of course you need to test the lexer, which you can do by creating a file called FuzzyJavaLexerTest.java and copying the code below in it:

import org.antlr.runtime.*;

public class FuzzyJavaLexerTest {
    public static void main(String[] args) throws Exception {
        String source = 
            "class Test {                                 \n"+
            "  String s = \" ... \\\" // no comment \";   \n"+
            "  /*                                         \n"+
            "   * also no comment: // foo                 \n"+
            "   */                                        \n"+
            "  char quote = '\"';                         \n"+
            "  // yes, a comment, finally!!!              \n"+
            "  int i = 0; // another comment              \n"+
            "}                                            \n";
        System.out.println("===== source =====");
        System.out.println(source);
        System.out.println("==================");
        ANTLRStringStream in = new ANTLRStringStream(source);
        FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object obj : tokens.getTokens()) {
            Token token = (Token)obj;
            if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
                System.out.println("Found a SingleLineComment on line "+token.getLine()+
                        ", starting at column "+token.getCharPositionInLine()+
                        ", text: "+token.getText());
            }
        }
    }
}

Next, compile your FuzzyJavaLexer.java and FuzzyJavaLexerTest.java by doing:

javac -cp .:antlr-3.2.jar *.java

and finally execute the FuzzyJavaLexerTest.class file:

// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest

or:

// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest

after which you'll see the following being printed to your console:

===== source =====
class Test {                                 
  String s = " ... \" // no comment ";   
  /*                                         
   * also no comment: // foo                 
   */                                        
  char quote = '"';                         
  // yes, a comment, finally!!!              
  int i = 0; // another comment              
}                                            

==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!              
Found a SingleLineComment on line 8, starting at column 13, text: // another comment

Pretty easy, eh? :)

Bart Kiers 2010-02-17 22:59:42

ANTLR *is* a parser generator.

KennyTM 2010-02-17 23:05:30

@KennyTM, err, I know. But ANTLR can be used to create a lexer only (without a parser) and even a lexer that lexes only parts you're interested in (making writing a grammar far easier: you don't need to parse the entire source file). Sorry for asking, but did you read my reply at all?

Bart Kiers 2010-02-17 23:07:30

Nice little ANTLR tutorial there! This is the kind of thing I can never seem to find on those rare occasions when I need something like ANTLR.

Alan Moore 2010-02-18 09:49:32

Thanks Alan. Yes, especially using a lexer grammar with `options{filter=true;}` which will let you specify only those tokens you're interested in, is not a very well known feature of ANTLR. I've used it quite a bit for syntax highlighting a little text editor I created. It makes adding a new language highlighter a breeze (when moderately familiar with ANTLR grammars, of course).

Bart Kiers 2010-02-18 09:59:55

+1 Great example! I just haven't taken the time to learn ANTLR yet and this will help. But I'm still pretty comfortable with RE's (see my answer), especially the more powerful implementation in Perl 5.10 so it will be a struggle to make the switch.

Adrian Pronk 2010-02-18 11:19:24

Answer 4

+1 A:

(This is in answer to the question @finnw asked in the comment under his answer. It's not so much an answer to the OP's question as an extended explanation of why a regex is the wrong tool.)

Here's my test code:

String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";

String test = 
    "class Test {                                 \n"+
    "  String s = \" ... \\\" // no comment \";   \n"+
    "  /*                                         \n"+
    "   * also no comment: // but no harm         \n"+
    "   */                                        \n"+
    "  /* no comment: // much harm  */            \n"+
    "  char quote = '\"';  // comment             \n"+
    "  // another comment                         \n"+
    "  int i = 0; // and another                  \n"+
    "}                                            \n"
    .replaceAll(" +$", "");
System.out.printf("%n%s%n", test);

System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));

r0 is the edited regex from your answer; it removes only the final comment (// and another), because everything else is matched in group(1). Setting multiline mode ((?m)) is necessary for ^ and $ to work right, but it doesn't solve this problem because your character classes can still match newlines.

r1 deals with the newline problem, but it still incorrectly matches // no comment in the string literal, for two reasons: you didn't include a backslash in the first part of (?:[^\"\r\n]|\\\"); and you only used two of them to match the backslash in the second part.

r2 fixes that, but it makes no attempt to deal with the quote in the char literal, or single-line comments inside the multiline comments. They can probably be handled too, but this regex is already Baby Godzilla; do you really want to see it all grown up?.

Alan Moore 2010-02-18 04:33:28

As usual, good stuff Alan.

Bart Kiers 2010-02-18 08:15:33

The question says nothing about multi-line comments, so I did not include them in my regex.

finnw 2010-02-18 12:39:24

You're right, the OP didn't say it was Java source code, just that there were comments and quoted strings--in fact, he didn't even mention escaped quotes. I ran with it anyway to demonstrate how quickly a pure-regex solution can turn into a quagmire as requirements creep. And the errors in your regex were pretty common ones, so it seemed worthwhile to dissect them.

Alan Moore 2010-02-18 13:09:08

Answer 5

+1 A:

The following is from a grep-like program I wrote (in Perl) a few years ago. It has an option to strip java comments before processing the file:

# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file.  Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================

sub strip_java_comments
{
      s!(  (?: \" [^\"\\]*   (?:  \\.  [^\"\\]* )*  \" )
         | (?: \' [^\'\\]*   (?:  \\.  [^\'\\]* )*  \' )
         | (?: \/\/  [^\n] *)
         | (?: \/\*  .*? \*\/)
       )
       !
         my $x = $1;
         my $first = substr($x, 0, 1);
         if ($first eq '/')
         {
             "\n" x ($x =~ tr/\n//);
         }
         else
         {
             $x;
         }
       !esxg;
}

This code does actually work properly and can't be fooled by tricky comment/quote combinations. It will probably be fooled by unicode escapes (\u0022 etc), but you can easily deal with those first if you want to.

As it's Perl, not java, the replacement code will have to change. I'll have a quick crack at producing equivalent java. Stand by...

EDIT: I've just whipped this up. Will probably need work:

// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately.  You'll figure it out)

Pattern p = Pattern.compile(
       "(  (?: \" [^\"\\\\]*   (?:  \\\\.  [^\"\\\\]* )*  \" )" +  //    " ... "
       "  | (?: ' [^'\\\\]*    (?:  \\\\.  [^'\\\\]*  )*  '  )" +  // or ' ... '
       "  | (?: //  [^\\n] *    )" +                               // or // ...
       "  | (?: /\\*  .*? \\* / )" +                               // or /* ... */
       ")",
       Pattern.DOTALL  | Pattern.COMMENTS
);

Matcher m = p.matcher(entireInputFileAsAString);

StringBuilder output = new StringBuilder();

while (m.find())
{
    if (m.group(1).startsWith("/"))
    {
        // This is a comment. Replace it with a space...
        m.appendReplacement(output, " ");

        // ... or replace it with an equivalent number of newlines
        // (exercise for reader)
    }
    else
    {
        // We matched a quoted string.  Put it back
        m.appendReplacement(output, "$1");
    }
}

m.appendTail(output);
return output.toString();

Adrian Pronk 2010-02-18 09:58:22

Nice example Adrian.

Bart Kiers 2010-02-18 18:26:14

ansaurus

tags:

views:

answers:

How to match a comment unless it's in a quoted string?

related questions