I am learning Parser Combinators in scala and seeing different ways of parsing.I mainly see three different kind of parsers ie.RegexpParsers,StandardTokenParsers and JavaTokenParsers.I am new to parsing and not getting the idea how we will choose the suitable Parser according to our requirement.Can any one please explain how these different parsers work and when to use them.
views:
191answers:
2RegexpParsers
allow you to use RE values (typically in the form "re pattern".r
but equally any other Regex instance). There are no pre-defined lexical productions (tokens).
JavaTokenParsers
defines lexical productions for Java tokens: decimalNumber
, floatingPointNumber
, stringLiteral
, wholeNumber
, ident
(identifier).
StandardTokenParsers
defines lexical productions "... for a simple, Scala-like language. It parses keywords and identifiers, numeric literals (integers), strings, and delimiters." Its constituents are actually defined in StdLexical
.
There are several different parser traits and base classes for different purposes.
The main trait is scala.util.parsing.combinator.Parsers
. This has most of the main combinators like opt
, rep
, elem
, accept
, etc. Definitely look over the documentation for this one, since this is most of what you need to know. The actual Parser
class is defined as an inner class here, and that's important to know about, too.
Another important trait is scala.util.parsing.combinator.lexical.Scanners
. This is the base trait for parsers which read a stream of characters and produce a stream of tokens (also known as lexers). In order to implement this trait, you need to implement a whitespace
parser, which reads whitespace characters, comments, etc. You also need to implement a token
method, which reads the next token. Tokens can be whatever you want, but they must be a subclass of Scanners.Token
. Lexical
extends Scanners
and StdLexical
extends Lexical
. The former provides some useful basic operations (like digit
, letter
), while the latter actually defines and lexes common tokens (like numeric literals, identifiers, strings, reserved words). You just have to define delimiters
and reserved
, and you will get something useful for most languages. The token definitions are in scala.util.parsing.combinator.token.StdTokens
.
Once you have a lexer, you can define a parser which reads a stream of tokens (produced by the lexer) and generates an abstract syntax tree. Separating the lexer and parser is a good idea since you won't need to worry about whitespace or comments or other complications in your syntax. If you use StdLexical
, you may consider using scala.util.parsing.combinator.syntax.StdTokenPasers
which has parsers built in to translate tokens into values (e.g., StringLit
into String
). I'm not sure what the difference is with StandardTokenParsers
. If you define your own token classes, you should just use Parsers
for simplicity.
You specifically asked about RegexParsers
and JavaTokenParsers
. RegexParsers
is a trait which extends Parsers
with one additional combinator: regex
, which does exactly what you would expect. Mix in RegexParsers
to your lexer if you want to use regular expressions to match tokens. JavaTokenParsers
provides some parsers which lex tokens from Java syntax (like identifiers, integers) but without the token baggage of Lexical
or StdLexical
.
To summarise, you probably want two parsers: one which reads characters and produces tokens, and one which takes tokens and produces an AST. Use something based on Lexical
or StdLexical
for the first. Use something based on Parsers
or StdTokenParsers
for the second depending on whether you use StdLexical
.