views:

340

answers:

2

I have an idea for a hobby project which performs some code analysis and manipulation. This project will require both the concrete and abstract syntax trees of a given source file. Additionally, bi-directional references between the two trees would be helpful. I would like to avoid the work of transcribing a grammar to construct my own lexer and parser.

Is there a standard format for describing either concrete or abstract syntax trees? Do any widely-used tool chains support outputting to these formats?

I don't have a particular target programming language in mind. Any popular one will do for a prototype, but I'd prefer one I know well: Python, C#, Javascript, or C/C++.

I'd like the ability to run a source file through a tool or library and get back both trees. In an ideal world, it would be practical to run this tool on code as it is being edited by a user and be tolerant of errors. Again, I am simply trying to develop a prototype, so these requirements are pretty lax.

Thanks!

A: 

In our project we defined the AST metamodel in UML and use ANTLR (Java) to populate the model. We also maintain the token information from ANTLR after parsing, but we have not yet tried to update the underlying text-file with modifications made on the model.

This has a hideous overhead (in infrastructure, such as Eclipse UML2/EMF), but our goal is to use high-level tools for Model-based/driven Development (MDD, MDA) anyway, so we decided to use it on each level.

I think one of our students once played with OpenArchitectureWare and managed to get changes from the Eclipse-based, generated editor back into the syntax tree (not related to the UML model above) automatically, but I don't know the details about this.

You might also want to look at ANTLR's tree grammars.

ShiDoiSi
ANTLR looks promising! The "Grammar List" <http://www.antlr.org/grammar/list> seems like a great starting point. I'll look deeper tomorrow. My goal is the tree data structures, I'd assume from the runtimes <http://www.antlr.org/wiki/display/ANTLR3/Runtime+libraries>.
Brandon Bloom
A: 

The research community decided that graph exchange was the right thing to do when moving information from one program analysis tool to another. See http://www.gupro.de/GXL

More recently, the OMG has defined a standard for interchanging Abstract Syntax Trees. See http://www.omg.org/spec/ASTM/1.0/Beta1/

This problem seems to get solved over and over again. There's half a dozen "tool bus" proposals made over the years that all solved it, with no one ever overtaking the industry. The problem is that a) it is easy to represent ASTs using any kind of nestable notation [parentheses like LISP, like XML, ...] so people roll their own solution easily, and b) for one tool to exchange an AST with another, they both have to agree essentially on what the AST nodes mean; but most ASTs are rather accidentally derived from the particular grammar/parsing technology used by each tool, and there's almost always disagreement about that between tools. So, I've seen very few tools that exchange ASTs meaningfully.

If you're doing a hobby thing, I'd stick with a lisp-like encoding of trees, where each node has the following format: ( ... ) Its easy to generate, and easy to read.

I work on a professional tool to manipulate programs. If we have print out the AST, we do the above. Mostly individual ASTs are far too complicated to look at in practice, so we hardly ever print out the entire AST, at best only a node and a few children deep. Our tool doesn't exchange ASTs with anybody (see above reasons :) but does just fine building it in memory, doing whizzy things with it for analysis reasons or transformation reasons, and then either just deleteing it (no need to send it anywhere) or regenerating the original language text from the tree. [The latter means you need anti-parsing or "prettyprinting" technology]

Ira Baxter