Scannerless Parser Generators

views:

424

answers:

+9 Q:

Scannerless Parser Generators

Prologue: Although the set of languages recognized by parsers (context-free grammars) is strictly larger than the one of scanners (regular grammars), most parser generators need a scanner.

(Please don't try to explain the reasons behind it, I know them quite well).

I've seen parsers, that do not require a scanner like

Elkhound (optional, Tomita/GLR)
DParser (Tomita/GLR)
SDLR (Tomita/GLR)

There are certain advantages of using no scanner:

Just one concept (context-free grammars) instead of two
Parse multiple languages in one file (like HTML/Javascript)
Parse languages without reserved keywords (like PL/1)

Often, "workarounds" are used like switching the scanner on the parser's request.

Question: Do you know any other scannerless parser generators (any language)? Are these practical in use (or purely academic)? Are there any other approaches than Tomita/GLR?

Answers:

boost::spirit::qi by AraK
Parsing Expression Grammars (LPEG for Lua) by Norman Ramsey
Yakker by Norman Ramsey
MBase by SK-logic

+1 A:

boost::spirit::qi doesn't need a lexer although you you can use boost::spirit::lex as a front-end. From the introduction of boost::spirit::lex:

In fact, Spirit.Qi allows you to write parsers without using a lexer, parsing the input character stream directly, and for the most part this is the way Spirit has been used since its invention.

boost::spirit(the original one) actually worked as you wanted exactly, but it has been moved to boost::spirit::classic. spirit::qi, spirit::lex are the new design of spirit, so I left the classical version out :)

AraK 2010-02-08 20:05:16

+3 A:

Parser generators don't need a scanner. But you are pretty much crazy if you don't use one.

Parsers built by parser generators don't care what you feed them, as long as you call them tokens.

To build use a parser generator without a scanner, simply define your grammar down to the character level, and feed individual characters to the parser as tokens.

The reason this is crazy is that parsing is a more complex activity than lexing. You can build lexers as finite state machines, which translate to machine code pretty much as "compare and jump to next state". For speed, that's really hard to beat. Parser generators construct parsers that do recursive descent predictive parsing (for most LL generators such as ANTLR) or do table lookups by hashing, binary or linear search, etc. So a parser spends a lot more energy on a token than a lexer spends on a character.

If you feed characters to a parser as tokens, it will then spend correspondingly more energy on the character than equivalent lexer will. If you process a lot of input text, this will eventually matter, whether you do it for zillions of small input streams or a few really large ones.

The so-called scannerless GLR parsers suffer from this performance problem, relative to GLR parsers that are designed to use tokens.

My company builds a tool, the DMS Software Reengineering Toolkit which uses a GLR parser (and successfully parses all the common langauges you know and lot of odder ones you don't, because it has a GLR parser). We knew about scannerless parsers and chose not to implement them because of the speed differential; we have a classically styled (but extremely powerful) LEX-like subsystem for defining lexical tokens. In the one case where DMS went nose-to-nose against an XT (a tool with a scannerless GLR parser) based tool processing the same input, DMS appeared to be 10x as fast as the XT package. To be fair, the experiment done was ad hoc and not repeated, but as it matched my suspicions I saw no reason to repeat it. YMMV. And of course, if we want to go scannerless, well, its pretty easy to write a grammer with character terminals, as I have already pointed out.

GLR Scannerless parsers do have another very nice property that won't matter to most people. You can take two separate grammars for a scannerless parser, and literally concatenate them, and still get a parser (often with a lot of ambiguities). This matters a lot when you are building one language embedded inside another. If that's not what you are doing, this is just an academic curiosity.

And, AFAIK, Elkhound isn't scannerless. (I could be wrong on this). (EDIT: 2/10: Looks like I was wrong. Won't be the first time in my life :)

Ira Baxter 2010-02-10 15:47:44

Thanks for your answer, especially for reporting about your experience from the industry.Note that linear speed differences are not a major handicap. Otherwise nobody would have written scanners in scripting languages.But GLR can result in non-linear performance on grammars that require unbounded lookahead (O(n³)), like identifier tokens. If parsers could handle tokens with linear complexity (and there are concepts to do this), why bother to write a scanner?

Meinersbur 2010-02-11 00:58:30

Scannerless Elkhound: http://scottmcpeak.com/elkhound/sources/elkhound/examples/scannerless/

Meinersbur 2010-02-11 01:00:21

@Meinersbur: I don't understand your point. You seem to be saying that scannerless parsers go nonlinear when trying to handle identifiers; that would suggest you agree that using a scannerless parser isn't a good idea. But then you say, "why bother to write a scanner?". What did I miss?

Ira Baxter 2010-02-11 18:52:48

I agree (to some extent) that scannerless parsing with GLR is not the best idea. But what with the other concepts (Noncanonical parsing, LR-Regular grammars, ELR parsing, ...)? Is there is an implementation which uses these instead, with linear complexity?

Meinersbur 2010-02-13 22:33:24

If there is real ambiguity or local ambiguity (requiring long-distance lookahead to resolve) in the grammar (induced by character-rules-for-terminals or by more traditional nonterminals, I don't think you should expect a gaurantee of linear time. Of course, you might be able to code your grammar to ensure there were no ambiguities. In that case, I don't know what is possible, but the very act of shuffling your grammar about to achieve such a property would make the parsing technology that much less interesting. GLR is expensive in theory; in practice, it works well.

Ira Baxter 2010-02-13 23:48:36

+2 A:

Two others:

Bryan Ford's Parsing Expression Grammars (PEG) require no scanner. Efficient, lazy "packrat parser" is optional. I've had nothing but good experience with the Lua LPEG version, which compiles to an efficient bytecode machine. Quite practical.
YAKKER looks very intriguing although it is still clearly in a pre-release state. They are using what they claim to be an efficient variation on Earley's parsing algorithm.

I'm actually a huge fan of scannerless parsers; they simplify configuration enormously. And typical scanner generators, to put it mildly, are not much fun to use. From the man page for Lex:

The asteroid to kill this dinosaur is still in orbit.

Finally, I have no personal experience with Elkhound, but the secondhand reports I hear are impressive. I would say there's no question but that some scannerless parser generators are very practical.

Norman Ramsey 2010-02-13 21:11:36

+1 A:

Sorry for a necropost. You can try this out - an embeddable PEG/Packrat implementation for .NET:

http://www.meta-alternative.net/pfdoc.pdf

http://www.meta-alternative.net/mbase.html

SK-logic 2010-03-18 09:30:03

No worries - I'm still happy for every answer

Meinersbur 2010-03-18 14:33:23

ansaurus

tags:

views:

answers:

Scannerless Parser Generators

related questions