views:

81

answers:

2

I'm attempting to parse one particular (home grown) JavaDoc tag in my JavaScript file and I'm struggling to understand how I can achieve this. Antlr is complaining as documented below:

jsDocComment 
    : '/**' (importJsDocCommand | ~('*/'))* '*/' <== See note 1
    ;

importJsDocCommand
    : '@import' gav
    ;

gav
    :  gavGroup ':' gavArtifact
    -> ^(IMPORT gavGroup gavArtifact)
    ;

gavGroup 
    : gavIdentifier
    ;

gavArtifact
    : gavIdentifier
    ;

gavIdentifier 
    : ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'.')* <== See note 2
    ;
  • Note 1: The following alternatives can never be matched: 1

  • Note 2: Decision can match input such as "'_'..'.'" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input

Here's what I'm trying to parse:

/** a */
/** @something */
/** @import com.jquery:jquery */

All lines should parse ok, with just the @import statement (along with its Maven group:artifact value) created under an AST tree element named "IMPORT".

Thanks for your assistance.

+1  A: 

Christopher Hunt wrote:

  • Note 1: The following alternatives can never be matched: 1

~('*/') is incorrect: you can only negate single characters in lexer rules (!). In your snippet, you're trying to negate something in a parser rule. In parser rules, you're not negating character(s), but tokens. For example:

parse : ~A;
foo   : .;
A     : 'A';
B     : 'B';
C     : 'C';

the parse rule will not match any character except 'A', but matches either 'B' or 'C'. And foo does not match any character, but matches any token (or lexer rule).

Christopher Hunt wrote:

  • Note 2: Decision can match input such as "'_'..'.'" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input

Two questions:

  1. did you post the entire grammar?
  2. are you trying to parse the entire JS file or are you just "filtering" JS files and pulling out the JavaDoc comments?

If it's the latter, there is a much easier way to do this using ANTLR (and can give an explanation if this is the case).

EDIT

It's easiest to just add a new DocComment rule the lexer and to palce it just above the (existing) Comment rule:

DocComment
  :  '/**' (options {greedy=false;} : .)* '*/'
  ;

Comment
  :  '/*' (options {greedy=false;} : .)* '*/' {$channel=HIDDEN;}
  ;
Bart Kiers
@Bart Thanks for the clarification re. characters vs tokens - that makes a lot of sense to me now! I didn't post the entire grammar - sorry. I'm adding to the ECMAScript grammar (http://www.antlr.org/grammar/1206736738015/JavaScript.g). Thanks for your great help.
Christopher Hunt
@Christopher, no problem. Do you still have a question or problem? If so, could you then post your modified grammar and indicate what the problem is?
Bart Kiers
@Bart, I really appreciate the feedback, but if I define DocComment then how do I parse out the @import declarations therein contained? Sorry, very new to ANTLR and appreciate the guidance.
Christopher Hunt
@Christopher, by pulling jsDocComment up to the parser, you're making your entire grammar more difficult. One example: the `*` is not only a multiplication token, but now also is a token inside your JavaDoc comments. Also, keywords in the normal JS grammar can occur inside the comment without it having any special meaning. The easiest is just to match it in the lexer and then parse that DocComment separately (you could even create a separate grammar for it that is being used to parse it).
Bart Kiers
@Christopher, also look at the following ANTLR [wiki page](http://www.antlr.org/wiki/display/ANTLR3/Island+Grammars+Under+Parser+Control) about this type of stuff.
Bart Kiers
A: 

My solution to this problem was to use ANTLR's Lexer without the parser and filter out stuff that I'm not interested in. Here's what I came up with (it also looks for globally defined variables as well as imports):

lexer grammar ECMAScriptLexer;

options {filter=true;}

@lexer::header {
    package com.classactionpl.mojo.javascript;
}

@members {
    int scopeLevel = 0;
}

IMPORTDOC
    :   '/**' .* IMPORT .* (IMPORT)* '*/'
    ;

fragment 
IMPORT
    :   '@import' WS groupId=GAVID ':' artifactId=GAVID
        {System.out.println("found import: " + $groupId.text + ":" + $artifactId.text);}
    ;

fragment
GAVID  
    :   ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'_'|'-'|'0'..'9'|'.')*
    ;

COMMENT
    :   '/*' .* '*/'
    ;

SL_COMMENT
    :   '//' .* '\n' 
    ;

ENTER_SCOPE
    :   '{' {++scopeLevel;}
    ;

EXIT_SCOPE
    :   '}' {--scopeLevel;}
    ;

WINDOW_VAR
    :   'window.' name=ID WS? value=(';' | '=') ~('=')
        {
            System.out.println("found window var " + $name.text + " = " + ($value == ';'));
        }
    ;

GLOBAL_VAR
    :   'var' WS name=ID WS? value=(';' | '=') ~('=')
        {
            if (scopeLevel == 0) {
                System.out.println("found global var " + $name.text + " = " + ($value == ';'));
            }
        }
    ;

fragment
ID  :   ('a'..'z'|'A'..'Z'|'$'|'_') ('a'..'z'|'A'..'Z'|'$'|'_'|'0'..'9')*
    ;

fragment
WS  :   (' '|'\t'|'\n')+
    ;
Christopher Hunt
@Christopher, shouldn't you account for string literals? Your lexer will produce invalid tokens when stumbling upon `var s = " { ";` because it will increase the `scopeLevel` because of the `{` in your string.
Bart Kiers
ah ha - yes, good point. I'd better do that. :-) However my main point is the use of the lexer and not the parser. Thanks though.
Christopher Hunt