tags:

views:

1376

answers:

7

I have an AST derived from the ANTLR Parser Generator for Java. What I want to do is somehow construct a control flow graph of the source code, where each statement or expression is a unique Node. I understand there must be some recursiveness to this identification, I was wondering what you would suggest as the best option and if ANTLR has a toolset I can use for this job. Cheers, Chris


EDIT - My main concern is to get a control flow graph(CFG) from the AST. This way I can get a tree representation of the source. To clarify, both the source code and the implementation language is Java.

A: 

Did you ever tryed ANTLR Studio? It does not generate the hole AST graph, but for review, its already quite helpful.

gizmo
ANTLR Studio is basically a language editor for the ANTLR's automatically generated parsers. I have the parsers and lexers. What I need is a way to manipulate the AST.Any thoughts?
A: 

When I have done this in the past, I used graphviz, in particular the dot tool, to generate the graph. I created the dot input file by actually traversing the control-flow graph at compile time.

Graph layout is a hard problem, and graphviz does an excellent job. It can output to ps, pdf, and various image formats, and the layout is usually pretty intuitive to look at. I highly recommend it.

Denis Bueno
I would be more interested as to how you traversed the control flow graph at compile time, rather than the actual visualization of the graph once it's constructed.Cheers
Usually at this point you have generated fairly low-level code that consist of non-jumping instructions and jumping instructions. The former correspond to CFG nodes, and the latter contains implicit edges (the places to jump to). See alsohttp://en.wikipedia.org/wiki/Control_flow_graph.
Denis Bueno
You may want to read up on "code generation": http://en.wikipedia.org/wiki/Code_generation_(compiler) -- this is the process of going from your AST to some lower-level representation, and this usually precedes the construction of the CFG.
Denis Bueno
A: 

Based on some comments, it sounds like the OP really wants to do code generation -- to convert the AST into a lower-level sequence of instructions based on basic blocks and jump points.

Code generation is very language-specific, and a lot of work has been put into this topic. Before you do code generation you need to know your target language -- whether it be assembler or simply some other high-level language. Once you have identified this, you simply need to walk the AST and generate a sequence of instructions that implements the code in the AST. (I say this is simple, but it can be hard -- it's hard to generalise because the considerations here are pretty language-specific.)

The representation you choose for code generation will contain the control-flow graph, implicitly or explicitly. If your target language is fairly low-level (close to assembler), then the control-flow graph should be relatively easy to extract.

(Please comment if you'd like more clarification.)

Denis Bueno
I agree that knowledge of the target language (Java) is imperative. I'm looking for some insight as to how to approach the AST walk into a form that implicitly holds the control flow graph. Any suggestions?
If you know how to generate the Java, then to make a CFG from the java: make a node for each statement that is not a method call into your program. For method calls, draw an edge to the entry of the body for that method.
Denis Bueno
In general this is a hard task, even if I knew your source language, which I don't. You just have to ... come up with a mapping of your source language constructs into Java.
Denis Bueno
+4  A: 

Usually CFGs are computed on a lower-level representation (e.g. JVM bytecode). Someone did a thesis on such things a few years ago. There might be a helpful way described in there for how to get at that representation.

Since your source and target languages are the same, there's no code generation step -- you're already done! However, now you get to walk the AST. At each node of the AST, you have to ask yourself: is this a "jumping" instruction or not? Method calls are jumping instructions. Other instructions are non-jumping.

First associate with each java statement a node in the CFG, along with an entry and exit node. As a first approximation, walk the tree and:

  1. if the current statement is a method call, figure out where the entry node is for the corresponding body of that method call, and make an edge pointing from the current statement to that entry node.
  2. for each non-jumping statement, make an edge between it and the next statement.

This will give you some kind of CFG. The procedure is slightly hairy in step 2 because the method called may be declared in a library, and not elsewhere in the AST -- if so, either don't make an edge or make an edge to a special node representing the entry to that library method.

Does this make sense?

Denis Bueno
+1  A: 

ANTLRworks can show you a syntax diagram, which may be just what you need. Also see this image

Jorn
A: 

This is a really quick way to getting started looking at your parse tree/graph in ANTLR. http://jumbleagilemanuals.blogspot.com/2008/03/10-steps-to-beginning-to-parsing-with.html

It's quite detailed.

hawkeye
+2  A: 

Producing a full control flow graph that really takes into account all the language issues is harder than it looks. Not only do you have to identify what appears to be the "basic blocks", but you have to identify function calls (sort of easy, but identifying the target might be harder), where behind-the-scenes operations such as class initializers can happen. and to worry about the points where exceptions can occur and where control goes if an exception does occur.

If you examine most languages carefully, they will also be clear about ordering of evaluation of computations in expressions, and this matters if you have two side-effects in an expression; the control flow should reflect the order (or the non-order, if it isn't defined).

Maybe you only want an abstraction of the control flow having the basic blocks and the conditionals. That's obviously a bit easier.

In either case (simple CFG or full CFG), you need to walk the AST, at each point having a reference to possible control flow targets (e.g., for most cases, such as IF statements, there are two flow targets: the THEN and ELSE clauses). At each node, link that node to the appropriate control flow target, possibly replacing the flow targets (e.g., when you encounter an IF).

To do this for the full language semantics of Java (or C) is quite a lot of work. You may want to simply use a tool that computes this off-the-shelf. See http://www.semanticdesigns.com/Products/DMS/FlowAnalysis.html for what this really looks like.

Ira Baxter