views:

62

answers:

3

To have a general-purpose documentation system that can extract inline documentation of multiple languages, a parser for each language is needed. A parser generator (which actually doesn't have to be that complete or efficient) is thus needed.

http://antlr.org/ is a nice parser generator that already has a number of grammars for popular languages. Are there better alternatives i.e. simpler ones that support generating parsers for even more languages out-of-the-box?

A: 

Where I work we used to use GOLD Parser. This is a lot simpler that Antlr and supports multiple languages. We have since moved to Antlr however as we needed to do more complex parsing, which we found Antlr was better for than GOLD.

adrianbanks
GOLD AFAIK is a pure LALR(1) parser generator, e.g., it is like Bison and YACC. The downside to such parser generator is virtually every real computer programming language doesn't have a natural LALR(1) grammar, and so immense amounts of energy are needed to bend and twist the grammar to fit LALR(1) parser generators, GOLD included. LALR(1) parser generators are ideal only for domain-specific langauges that are *designed* to have LALR(1) grammars.
Ira Baxter
A: 

See answers to SO question Source of Parsers for Programming Languages

Ira Baxter
A: 

If you're only looking for "partial parsing", then you could use ANTLR's option to partially "lex" a token stream and ignore the rest of the tokens. You can do that by enabling the filter=true in a lexer-grammar. The lexer then tries to match any token you defined in your grammar, and when it can't match one of the tokens, it advances one single character (and ignores it) and then again tries to match one of your token at the next character:

lexer grammar Foo;

options {filter=true;}

StringLiteral
  :  ...
  ;

CharLiteral
  :  ...
  ;

SingleLineComment
  :  ...
  ;

MultiLineComment
  :  ...
  ;

When implemented properly, you can get the MultiLineComments (/* ... */) from a Java file quite easily without being afraid of single line comments and String- or char literals messing things up.

Obviously, your source files need to be valid to be able to properly tokenize a file, otherwise you get strange results!

Bart Kiers