views:

237

answers:

5

There are so many programming languages which support the inclusion of mini-languages. PHP is embedded within HTML. XML can be embedded within JavaScript. Linq can be embedded within C#. Regular expressions can be embedded in Perl.

// JavaScript example
var a = <node><child/></node>

Come to think of it, most programming languages can be modeled as different mini-languages. Java, for example, could be broken into a least four distinct mini-languages:

  • A type-declaration langauge (package directive, import directives, class declaration)
  • A member-declaration language (access modifiers, method declarations, member vars)
  • A statement language (control flow, sequential execution)
  • An expression language (literals, assignments, comparisons, arithmetic)

Being able to implement those four conceptual languages as four distinct grammars would certainly cut down on a lot of the spaghettiism that I usually see in complex parser and compiler implementations.

I've implemented parsers for various different kinds of languages before (using ANTLR, JavaCC, and custom recursive-descent parsers), and when the language gets really big and complex, you usually end up with one huuuuuuge grammar, and the parser implementation gets really ugly really fast.

Ideally, when writing a parser for one of those languages, it'd be nice to implement it as a collection of composable parsers, passing control back and forth between them.

The tricky thing is that often, the containing langauge (e.g., Perl) defines its own terminus sentinel for the contained language (e.g., regular expressions). Here's a good example:

my $result ~= m|abc.*xyz|i;

In this code, the main perl code defines a nonstandard terminus "|" for the regular expression. Implementing the regex parser as completely distinct from the perl parser would be really hard, because the regex parser wouldn't know how to find the expression terminus without consulting the parent parser.

Or, lets say I had a language which allowed the inclusion of Linq expressions, but instead of terminating with a semicolon (as C# does), I wanted to mandate the Linq expressions appear within square brackets:

var linq_expression = [from n in numbers where n < 5 select n]

If I defined the Linq grammar within the parent language grammar, I could easily write an unambiguous production for a "LinqExpression" using syntactic lookahead to find the bracket enclosures. But then my parent grammar would have to absorb the whole Linq specification. And that's a drag. On the other hand, a separate child Linq parser would have a very difficult time figuring out where to stop, because it would need to implement lookahead for foreign token-types.

And that would pretty much rule out using separate lexing/parsing phases, since the Linq parser would define a whole different set of tokenization rules than the parent parser. If you're scanning for one token at a time, how do you know when to pass control back to the lexical analyzer of the parent language?

What do you guys think? What are the best techniques available today for implementing distinct, decoupled, and composable language grammars for the inclusion of mini-languages within larger parent langauges?

+1  A: 

Parsing is one aspect of the problem, but I suspect that the inter-operation between the various executable interpreters that relate to each mini-language is probably significantly harder to solve. In order to be useful, each independent syntax block has to work consistently with the overall context (or the final behavior will be unpredictable, and thus unusable).

Not that I understand what they are really doing, but a very interesting place to look for more inspiration is FoNC. They seem (I am guessing) to be headed into a direction that allows all sorts of different computational engines to interact seamlessly.

Paul W Homer
+2  A: 

I am working on this exact problem. I'll share my thoughts:

Grammars are difficult to debug. I've debugged a few in Bison and ANTLR and it wasn't pretty. If you want the user to insert DSLs as grammar into your parser, then you've got to find some way to make it so it doesn't blow up. My approach is to not allow arbitrary DSLs, but to only allow those that follow two rules:

  • The token types (identifiers, strings, numbers) are the same between all DSLs in a file.
  • Unbalanced parentheses, braces, or brackets are not permitted

The reason for the first restriction is because modern parsers break parsing into a lexical stage and then apply your traditional grammar rules. Fortunately, I believe that a single universal tokenizer is good enough for 90% of the DSLs you want to create, even if it doesn't accommodate the DSLs you already created that you want to embed.

The second restriction allows the grammars to be more separated from each other. You can parse in two stages by grouping parentheses (braces, brackets) and then recursively parsing each group. The grammar of your embedded DSL can't escape through the brackets it's contained in.

Another part of the solution is to allow macros. For example, regex("abc*/[^.]") looks fine to me. This way the macro "regex" can parse the regex instead of building a regex grammer into the main language. You can't use different delimiters for your regex, sure, but you do gain a measure of consistency in my mind.

Dietrich Epp
A: 

If you think about it, this is really how recursive descent parsing works. Each rule and all the rules it depends on form a mini grammar. Anything higher up does not matter. You could, for example, write a Java grammar with ANTLR and separate all the different "mini-languages" into different parts of the file.

This isn't very common simply for the reason that these "mini-languages" will often share many rules. However, it definitely would be nice if tools like ANTLR allowed you to include separate grammars from different files. This would allow you to logically separate them. The reason this is probably not implemented is probably that it is a "cosmetic" issue, it's purely related to the grammar files themselves, not parsing itself. It also won't make your code any shorter (although it may be slightly easier to follow). The only technical issue this would solve is name collisions.

Zifre
The problem is the terminals/"tokens". Defining a non left-recursive grammar for all of those regular expression "tokens" quickly becomes unmanageable.
Paul Hollingsworth
+2  A: 

You might want to listen to this podcast. Scanner-less parsing was "invented" to help solve the problem of composing different grammars (the problem being that you quickly find that you can't write a "universal" tokenizer/scanner).

Paul Hollingsworth
+1  A: 

Have a look at SGLR, Scannerless Generalized LR parsing. Here are some references and URLs. This parsing techniques makes composition of parsing tables very simple. Especially in combination with SDF.

Martin Bravenboer and Eelco Visser. Designing Syntax Embeddings and Assimilations for Language Libraries. In Models in Software Engineering: Workshops and Symposia at MoDELS 2007, volume 5002 of LNCS, 2008.

MetaBorg and MetaBorg in action

Zef Hemel