From Source to AST to CodeDom

I am reading the book Language Implementation Patterns (http://tiny.cc/235xs) amongst a few others mixed in to clarify concepts as well as the occasional website. I am trying to make a tool that reads a trivial programming language and performs some basic analysis on it.

I am getting stuck in the design phase of this tool. I have constructed a simple handwritten recursive decent parser that validates a source file just fine. However, to perform source manipulations having a CodeDom tree would be useful.

The questions:

1) Are the logical steps a tool like this takes: Parse and build a textual tree and matching symbol table and then convert this to a CodeDom?

2) When building a textual tree, the most convenient would be a AST, easier to convert to a CodeDom .. but do Refactoring tools maintain a list of all embedded tokens in a statement in order to preserve inline comments and how do they track this in their tree?

You can build your own parser, your own tree builder, your own tree walker, your own analyzers, your own prettyprinters... but its a lot of work.

You might consider tools that provide all this machinery for you.

One such tool is our DMS Software Reengineering Toolkit.

Given a grammar, DMS will parse and automatically build a tree; yes, it automatically captures "microtokens" such as comments and attaches them to appropriate tree nodes. It can prettyprint the tree back out, before or after transformations. You have to provide support for symbol tables since that's a semantic, not a syntactic construct, but DMS provides generic symbol tables and scope management tools as a library to build upon. DMS also provides complete libraries for control and data flow analysis, which is needed if you want to do serious code transformations or refactorings.

One of DMS's nicest properties is the ability to apply transformations stated using the syntax of your grammar, e.g., "if you see this (in my language) then replace it by that".

You can see an example of defining lexer, parser, prettyprinter and transformation rules that define 9th grade algebra and a bit of calculus. The rewrite rules are used to carry out simplifications and computing symbolic derivatives on algebraic formulas.

Thank you for taking the effort to answer. I am aware of your product thanks to Google as well as a number of other answers to similar questions here on StackOverflow. But I am not looking for software to do it all for me; I am looking to learn how these tools work and accomplish their things.

Jaapjan 2010-06-03 10:25:18

@Jaapjan: Then what you need is a compiler book, such as Aho/Ullman Dragon book. This will explain in detail what you need to parse and build syntax trees. What it won't do is explain how to capture comments, how to implement rewrite rules, or how to build a prettyprinter. You can find technical papers on these topics if you search the computer science literature, but it is scattered everywhere. A key place to start is with conferences focused on this task, such as SCAM10, "Source Code Analysis and Manipulation". Its *still* a *lot* of work. DMS is over a man-century of PhD level work.

Ira Baxter 2010-06-03 10:36:32

True. But I can build an AST with a stack and node structure, adding each block and statement as node. But the question is actually if, usually, such an AST is a pure textual affair that captures the syntax structure or if this AST also contains the whitespace, comments and possibly internal references to other AST nodes, like superclass references. Ie: Are these different trees usually or do tools combine them?

Jaapjan 2010-06-03 11:05:12

AST nodes are typically nodes representing terminals in your grammer (or abstractions). Useful ASTs break down statements into trees including the operators and operands. You can think of the terminals in your grammars as nonterminals, if you consider the lexer to be a sub-parser; so all elements of your grammer have tree nodes, which produces a *concrete* syntax tree. You'll discover you can leave some out; that gives you an *abstract* syntax tree. Terminal nodes have values (identifiers, numbers) often contain a representation of the identifier/number.

Ira Baxter 2010-06-03 14:47:51

"Whitespace" is expensive and confusing to store literally. Nobody in practice stores the blanks that make up whitespace; instead they stamp tree nodes with column information. Comments are harder; they're easy to lex, but it is difficult to decide what to do with them; is this comment related to this token, the previous token, this block, the previous block...? Each system when can store these answers this question differently. DMS provides mechanism for comment capture; the langauge designer provides policy about where to capture them.

Ira Baxter 2010-06-03 14:50:25

Thank you! These answers are useful. I am using an homogeneous tree node structure to build an abstract syntax tree from the parsed source. You are saying most tools implicitely store whitespace by storing the columns and lines of actual tokens and use that during refactoring. Mm. And you could, I suppose, link embedded comments to the nearest node. As to my other question, do typical tools generate a CodeDom (with hard links to references such as variables and base classes) or build an exectutable simply using the AST and the symbol table?

Jaapjan 2010-06-04 07:31:01

The term "CodeDom" seems to be a Microsoft-ism for "bad syntax tree", that is, ones that don't provide all the structure for the complete language (e.g., statements are CodeDom nodes that contain text of the statement). I prefer to think about complete (abstract/concrete) syntax trees. "Hard references to varaibles"? Identifier nodes are just tree nodes. Whether they are linked (by some additional mechanism) to symbol table data stating the type of the ocrresponding language-scoped identifier isn't usually what a bare parser does, but that's certainly useful. "Base classes"?

Ira Baxter 2010-06-04 11:52:47

In this particular case I meant a tree that does not use text nodes expect for identifiers but where all references are hard-linked to their own node in the tree. Such as super/base classes in a class-node, a variable reference in an expression that points to the definition, such a thing. This way, you do not need a symbol table. If you keep it as text fields, you need a symbol table alongside your AST to track what symbol you're talking about in later stages. Right?

Jaapjan 2010-06-08 09:17:09

Just hard linking identifiers back to "the" declaration in the tree is a nice touch, but it won't get you away from symbol tables. The problem is that to work with the *type* information at the declaration, with just your hard link, you have to re-interpret the AST at the declration point each type you are interested in the type. A symbol table caches a pre-digested version of the type information that is designed to make access to the type information fast. You may also find that the type information for a symbol comes from several places; what do you link to in that case?

Ira Baxter 2010-06-08 09:36:58

So there's a theme here. If you don't care about speed, frankly you can work with the raw text. ASTs are a cache of the analysis of the structure of the code according the grammar rules. Symbol tables are a cache of the analysis required to figure out the type information. Control flow graphs are a cache of the analysis of, well, control flow, and similary for data flow, call graphs, etc. The point is that ASTs aren't everything, you'll need to do additional analysis, you probably need it more than once, so you'll end up caching it.

Ira Baxter 2010-06-08 09:39:01

A very good point-- it might indeed come from multiple points now that you mention it. And not everything can be determined immediately.

Jaapjan 2010-06-08 09:51:52

If you like the back-linking trick, you should go read the papers on Intentional Programming http://en.wikipedia.org/wiki/Intentional_programming, see especially the discussion about "identity". They get a lot of mileage out of it. FWIW, DMS doesn't keep an explicit backlink; rather, it keeps an auxiliary hashmap to node-to-symbol table links.

Ira Baxter 2010-06-08 09:56:54

What I had done to date was constructing an AST with "hard links" not set yet and build a scoped symbol table alongside a list of "to-be-fixed" references. At the end, walk the reference table and fix up the hard links in the AST... And how do you reply so fast/notice updates here?

Jaapjan 2010-06-08 10:19:49

One theory is that I'm just an advanced AI application :-} A better theory is I look at SO each day, and SO tells me (see the little envelope at the top of the page) when there is a comment to an answer. Your responses are pretty fast, too.

Ira Baxter 2010-06-08 15:12:01

You might have been using one of the API applications or tools they have available, sort of, by now. Back to my question, you would advice using a textual AST without backlinks and a matching symboltable for code/language file generation then?

Jaapjan 2010-06-08 15:34:44

You need the AST. You need the symbol table. If you want to do really spectacular code generation you need control and data flow analysis, etc. You don't need the backlinks, and I don't think you'll find them very useful if you have the symbol table.

Ira Baxter 2010-06-08 17:26:24

ansaurus

tags:

views:

answers:

From Source to AST to CodeDom

related questions