views:

70

answers:

4

I am learning on my own about writing an interpreter for a programming language, and I have read about Abstract Syntax Trees. I have an idea of what they are, but I do not see their use.

Why are ASTs useful?

+2  A: 

They represent the logic/syntax of the code, which is naturally a tree rather than a list of lines, without getting bogged down in concrete syntax issues such as where you place your asterisk.

The logic can then be manipulated in a manner more consistent and convenient from the backend's POV, which can be (and is, for everything but Lisps ;) very different from how we write the concrete syntax.

Roger Pate
So why would I use an AST instead of just making a stream of stuff like "IDENTIFIER blah ASTERISK NUMBER 4" and having an FSM or something go down and eat one token at a time?
RacecaR
@Rac: What you describe is a step between the raw file and generating an AST. That is: raw -> tokens -> AST. BinaryOp(Multiply, Identifier("blah"), Integer(4)) is more convenient for the backend.
Roger Pate
@RacecaR your FSM for parsing would eat the stream to produce an AST; your code-generation and -optimization routines would use the AST
Steven A. Lowe
Is there a specific name for that? Also I can think of something where the interpreter goes along and sees an if statement, and the bytecode or whatever you have says that if something is true, the interpreter should go forwards 8 tokens, if not, go forward 0 tokens (or something). How would an AST work for something like that? Maybe give the next path (next node) the interpreter should go down?
RacecaR
[Lexical analysis](http://en.wikipedia.org/wiki/Lexical_analysis). The AST would have one node for the condition which has two subnodes: true and false. An AST is very different from bytecode/asm (they're the same thing, just targetting different machines) where you have "jump forward X instructions" (not tokens).
Roger Pate
So is that's the difference between Virtual Machines and ... whatever this would be called?
RacecaR
Virtual machines are implemented "on top of" another machine. However, even x86 is a [virtual machine](http://www.infoq.com/presentations/click-crash-course-modern-hardware), but, generally, "VM" is used to describe something implemented purely in software.
Roger Pate
Ah I see. But, a normal programming language generates an AST and runs down it, but a Virtual Machine based language has bytecode and stuff which it gobbles up with an FSM or something right? Sorry I can't understand this but I want to make sure I get all the info I can.
RacecaR
How the AST is used depends on what your purpose is. CPython (the main Python implementation), for example, generates bytecode from an AST which its interpreter can use. However, it's possible to "execute" directly from an AST too, and many "smaller" interpreters do this (/bin/sh comes to mind, but there are many implementations of it).
Roger Pate
Ok, so bytecode is for VMs. Thanks very much Roger, this has helped me a lot.
RacecaR
A: 

In general you are going to parse you code into some form of AST, it may be more or less of a formal model. So I think what Kirk Woll was getting at by his comment above is that when you parse the language, you very often use the parser to create some sort of data model of the raw content of what you are reading, generally organized in a tree fashion. So by that definition an AST is hard to avoid unless you are doing a very simple translator.

I use ANTLR often for parsing complex languages and in that context there is a slightly more specific meaning of an AST. ANTLR has a handy way of generating an AST in the parser grammar using pretty simple actions. You then write a generally much simpler parser for this AST which you can operate on like a much simpler version the language you are processing. Whether the extra work of building two parsers is a net gain is a function of the language complexity and what you are planning on doing with with it once you parsed it.

A good book on the subject that you may want to take a look at is "Language Implementation Patterns" by Terrence Parr the ANTLR author. He addresses this topic pretty thoroughly. That said, I didn't really get ASTs until I started using them, so that (as usual) is the best way to understand them.

JeffW
A: 

The main benefit os using an AST is that you separate the parsing and validation logic from the implementation piece. Interpreters implemented as ASTs really are easier to understand and maintain. If you have a problem parsing some strange syntax you look at the AST parser , if a pices of code is not producing the expected results than you look at the code that interprets the AST.

The other great advantage is when you syntax requires "lookahead" e.g. if your syntax allows a subroutine to be used before it is defined it is trivial to validate the existence of a subroutine when you are using an AST - its much more difficult with an "on the fly" parser.

James Anderson
A: 

You need "syntax trees" to represent the structure of most programming langauges, in order to carry out analysis or transformation on documents that contain programming language text. (You can see some fancy examples of this via my bio).

Whether that tree is abstract (AST) or concrete (CST) is a matter of taste, convenience, and engineering sweat. The term CST is specially used to describe the parse derivation tree when a grammar is used to deconstruct source code; it usually contains tree elements for lots of concrete syntax such as statement terminator semicolons. AST is used to mean "something simpler than the CST", e.g., leaving out semicolon tree nodes because they don't affect program analysis much, and thus writing analyzers that process ASTs is less conceptual and engineering effort than writing the same analyzer on a CST. A better way to understand this is to realize that the AST is usually as isomorphic equivalent of the CST, that is, you should be able to regenerate the CST from it. If you want to transform the source text and regenerate it, then the CST is often a better choice as it loses less information from the original program (and my fancy example uses this approach).

I think you will find the SO discussion on abstract vs. concrete syntax trees pretty helpful.

Ira Baxter