views:

405

answers:

7

What are common traits/properties of programming languages that facilitate (simplify) the development of widely automated source code analysis and re-engineering (transformation) tools?

I am mostly thinking in terms of programming language features that make it easier to develop static analysis and refactoring tools (i.e. compare Java vs. C++, the former of which has better support for refactoring).

In other words, a programming language that would be explicitly designed to provide support for automated static analysis and refactoring right from the beginning, what characteristics would it preferably feature?

For example, for Ada, there's the ASIS:

The Ada Semantic Interface Specification (ASIS) is a layered, open architecture providing vendor-independent access to the Ada Library Environment. It allows for the static analysis of Ada programs and libraries. ASIS, the Ada Semantic Interface Specification, is a library that gives applications access to the complete syntactic and semantic structure of an Ada compilation unit. This library is typically used by tools that need to perform some sort of static analysis on an Ada program.

ASIS information: ASIS provides a standard way for tools to extract data that are best collected by an Ada compiler or other source code analyzer. Tools which use ASIS are themselves written in Ada, and can be very easily ported between Ada compilers which support ASIS. Using ASIS, developers can produce powerful code analysis tools with a high degree of portability. They can also save the considerable expense of implementing the algorithms that extract semantic information from the source program. For example, ASIS tools already exist that generate source-code metrics, check a program's conformance to coding styles or restrictions, make cross-references, and globally analyze programs for validation and verification.

Also see, ASIS FAQ

Can you think of other programming languages that provide a similarly comprehensive and complete interface to working with source code specifically for analysis/transformation purposes?

I am thinking about specific implementation techniques to provide the low level hooks, for example core library functions that provide a way to inspect an AST or ASG at runtime.

+2  A: 

Reflection built into the language/type system. This makes static analysis and refactoring much less painful.

This is part of why Java and .NET tools are so commonplace and nice. This provides the tools with much better functionality in terms of understanding depdencies of source code quickly and reliably, which helps with the static analysis of source code.

In addition, you get the ability to do analysis of your compiled code, as well.

Reed Copsey
Thanks for your response - yes, reflection is of course an obvious candidate, still there are programming languages that can deal largely without it, i.e. Ada doesn't really have much support for reflection at all, still it has a really good interface for source code analysis and transformation.
none
Yeah - but you mentioned ASIS in your question itself - Reflection, as a general purpose concept, provides many of the same advantages of ASIS in a different form. You can do source analysis with nothing but source code (and no tools), but you're basically writing your own compiler to do it.
Reed Copsey
Maybe I'm misunderstanding the question or this answer--but I think refactoring is made more difficult by "Reflection". The information provided in order to do reflection is very helpful, but actual reflection calls inside the code tend to blow most tools completely out of the water. To find instance of a class, you now have to scan textual strings (flaky) or switch from static analysis to runtime analysis which is also flaky since it depends on a specific piece of code being hit.
Bill K
Reflection as implmented in most langauges is pretty lousy for serious code analysis. Usually all you get is the ability to inquire about lists of symbols (e.g., methods) related to another symbol (e.g., a class). To do serious analysis, you want to be able to "reflect" (really, inquire) about the finest structure/detail of any detail of the program. To do that, you need access to the source text directly, or a more easily manipulated equivalent, e.g. an abstract syntax tree with symbol tables, control and data flow information. That in turn requires serious analysis infrastructure.
Ira Baxter
Continuing... agree with earlier comment that reflection makes a language that much harder to reason about, because if a program is doing analysis, and wants to reason about itself, it now also has to reason about how it reasons about itself. Shades of the halting problem all by itself. Better IHMO to have the analysis machinery outside the program, so that that analyzer doesn't have to reason about the analyzer too.
Ira Baxter
I find these arguments fairly unreasonable. Reflection gives the analysis one more tool to work with - it's not taking anything away. A tool is free to do its own full "compilation" to construct a full syntax tree and symbol table, but reflection makes it easier to do that, since it no longer needs to understand and compile dependency libraries, and do the full compilation to understand interrelationships with dependent libraries. I do agree with Bill K's comment-it can make source code more difficult, but it depends on how the code is written (ie: using typeof(..) vs GetType("string"), etc)
Reed Copsey
+5  A: 

The biggest has to be static typing. This allows tools to have much more insight into what the code is doing. Without it refactoring becomes many times more difficult.

Bill K
I actually agree, I think this is one of the key reasons why Ada's support is so extensive, because of Ada's strong support for typing:http://en.wikibooks.org/wiki/Ada_Programming/Type_System
none
Refactoring in a dynamic language can be a big challenge, because you never really know what a variable is going to be at runtime (scalar,vector,hash,function object).
none
If you can do flow anlaysis, you can often determine what theactual type of a dynamic variable is; most good coders don't put 23 different kinds of things into the same variable. So the real need to support refactoring in dynamic languages is really good flow analysis.
Ira Baxter
@Ira That's very interesting. I've been struggling with refactoring in Ruby because of it's dynamic typing. People say that IDEs don't need any more features, but I'd love to see someone build one with this flow analysis stuff you're talking about.
LoveMeSomeCode
@LoveMeSomeCode: see my answer to this question. The DMS tool has very strong flow analysis machinery for C, Java, COBOL. Not yet configured for Ruby :-}
Ira Baxter
+1  A: 

It is true that the particular programming language can make analysis easier. If you want the easist-to-analyze languages, pick a purely functional one.

But nobody in practice programs in purely functional langauges. (The Haskell guys are going to jump up and down when they see this, but seriously, Haskell is used only extremely rarely).

What makes a programming language analyzable is infrastructure designed to support analysis. Ada's ASIS above, is a great example. Don't confuse the fact that ASIS was written for Ada, or is written in Ada; what counts is that somebody serious wanted to analyze Ada and invested the effort to build Ada analysis machinery.

I believe that the right cure is to build general analysis infrastructure and amortize it across lots of langauges. While we're at it, we should build general transformation infrastructure, too, because once you have an analysis, you'll want to use it to effect change. (Doctor visits don't end with diagnosis; they end with cures). And I've bet my career on it.

The result is an engine I think ideal for analysis, refactoring, reengineering, etc: the DMS Software Engineering Toolkit. See http://www.semdesigns.com/Products/DMS/DMSToolkit.html

It has generic parsing, tree building, prettyprinting, tree manipulation, source-to-source rewriting, attribute grammar evaluations, control and data flow analysis. It has production quality front ends for a number of widely used dialects of C and C++, for Java, C#, COBOL, and PHP, and even for Verilog and VHDL (many other langauges too, but not quite at that level).

To give you some sense of its utility, it was used to convert JOVIAL code for the B-2 bomber into C... without us ever having seen the source code. See http://www.semdesigns.com/Products/Services/NorthropGrummanB2.html

Now, assuming one has analysis infrastructure, what language features help?

Static types helps by limiting the set of possible values a variable can take, but only by adding a limited single-argument predicate, e.g., "X is an integer". I think what helps more are assertions in the code because they capture predicates with more than one argument that often cannot be found by inspecting the code (e.g., problem or domain specific information, e.g., "X > Y+3".) The analysis infrastructure (and frankly, the programmers that read the code) can ideally take advantage of such additional facts to provide a more effective analysis.

Such assertions are commonly coded with special keywords such as "assert", "pre(condition" and "post(condition" that are inspired with good reason from the theorem proving literature.

But even if you don't have assertions in your language, they are easy to encode anyway: just write an if statement with the condition containing the assertion denial, and the body doing something that calls an idiom indicating impossibility or violates the language semantics (e.g., deref an obviously null pointer), such as "if (x>0) fail();"

So what's really needed isn't assertions in the language, but programmers who are willing to write them. Alas, that seems to be sadly lacking.

Ira Baxter
thanks for your comments, I am aware that technologies like ASIS are not necessarily language specific, but the corresponding infrastructure is -like you say- unfortuntately not easily available in other languages. In C or C++, one part of the complexity is caused by the languages themselves (IIRC you are using the EDG frontend?). And while there was even some talk on the gcc mailing list to provide such an infrastructure some time ago http://gcc.gnu.org/ml/gcc/2007-11/msg00522.html this didn't really materialize much.
none
We are not using the EDG front end. EDG wants "to be a compiler", which means it is terrible for reengineering purposes. Our tools are designed to capture all kinds of information about the source code, partly so it can be regenerated later. This includes activities like retaining comments retaining the radix of integers, NOT expanding preprocessor directives where possible, parsing thousands of compilation units in a single run (compilers do one at a time), etc. We build our langauge front ends using our own parsing machinery to make sure we can capture all this data correctly.
Ira Baxter
I think the combination of a) functional programming (style), b) static/strong typing, c) exposing language internals (e.g. hooks into the parser) and d) providing analysis and transformation infrastructure on top of it, is pretty much the answer to the original question.
none
Hooks into *a* parser, not necessarily the one for the language compiler; that one is completely biased against helping you. Its just easier to have the parser and analysis/transformation machinery packaged separately from the "langauge" and from its compiler. *But the real win* are assertions: no amount of program analysis machinery can guess properties of the problem domain(in banks, you *have* to conserve money!) and so these simply have to be told to whatever tools you have. My point is they are easy to code already, but our culture doesn't encourage it.
Ira Baxter
A: 

For refactoring: self-similarity

The ability to accept code transplants without intrusive alteration or bizarre reinterpretation. Examples:

  • Extract a snippet of C++ to a new procedure, by using reference parameters to give it modifing access to variables.
  • Python, Javascript and Lua methods really are just functions that have a 'self' parameter. *
  • In any number of languages, a function that creates/populates a struct can be (more or less trivially) converted to a constructor.

Counterexamples...

  • Ruby (modules, classes), methods lambda block and raw blocks: The differences in semantics are bewildering to say the least. (which is all I feel qualified to say for sure.)

For the (to my mind) wildly different case of automatic mangling I'm a lot less sure, but the freedom from side-effects offered by functional programming languages is really it. (Ok, so how could we offer the same thing in a language for the rest of us?)

* Python is almost like that. (I forgot what the gotcha is. Probably something to with if method was defined in class or grafted on, runtime.)

Anders Eurenius
A: 

I think this is still a largely unexplored problem. The notion of "language design for tooling" seems to only have entered the fringes of the mainstream recently, though I think research in this area is more than two decades old. I agree with two of the other answers, namely that "static typing" and "self-similarity" are useful properties of a language to make refactoring support easier.

Brian
+1  A: 

There is a language sharing "code is data" paradigm. E.g. every line of code is just data in terms of this language. This make refactoring to be as basic action as primitive data operations. And the name of this language is Lisp. ;)

Seriously speaking, "language for programming" and "language for machine" are two different requirements. And a perfect language for analyzing could be nightmare for programmer. Even more, language designed for some analysis could be not programming language at all. (Last week I met the language for pointer analysis, and it has no textual representation and only two executable statements)

And again: first you have to define the task and then solve it. For example: if the task is "I want to write safe programs, e.g. I want to be sure that I will never try to mix integral and character operands", then you need a language with static types. Ok, "I need to know at runtime what I can do with external libraries" - reflection is your choice. "I need universal programming language for interchanging, transformations and analysis" - most likely, this is not what you really want.

vpolozov
A: 

IMO the most important property is that the language is completely specified and deterministic. For example, in C the behaviour of following code is not defined by the language specification:

x++ = x++ + ++x;

If the code's behaviour is undefined, but yet it compiles and does something, there is no safe way to automatically change it (i.e. refactor it) in a way that preserves that something.

The next important property is that it doesn't allow access to variables (fields) beyond its scope. Pointers make it possibe e.g. in C to access any variable's value simply by "guessing" the address. In a language like that, there are cases where it is not possible to tell where in the code a certain variable's value is read and/or changed. Again, there is no safe way to automatically refactor a program that might do something like that.

ammoQ