views:

81

answers:

1

I'm interested in natural language processing and wondering what approaches have been used to search for programming constructs and idioms in code. Assuming we can generate an abstract syntax tree for the program it should be possible to identify common constructs and perhaps deduce something about the motivation of the program(mmer). For example if I write:

x = a + r * Math.cos(theta);

and you write

var1 = var2 + var3 * Math.cos(var4);

these have the same abstract tree (with different annotations).

If I have a program which contains the statements in close proximity:

var1 = var2 + var3 * Math.cos(var4);
var5 = var6 + var3 * Math.sin(var4);

then I might be able to deduce that the programmer was converting from polar to Cartesian coordinates and could therefore search for it.

Obviously there is a problem with normalization as someone else could write

var1 = Math.cos(var4) * var3 + var2;

but maybe there are ways to canonicalize such statements.

If this is a well-advanced field, have there been studies of the most frequent constructs that are actually used?

+1  A: 

The DMS Software Reengineering Toolkit provides AST generation for many languages, as well as surface-syntax patterns used for matching against ASTs, generating ASTs, and transforming code. Here you can learn something about how DMS works with surface-syntax patterns and why these are hugely better than matching trees-as-data-structures.

A DMS pattern matching your statement for JavaScript would be x = a + r * Math.cos(theta); would look like:

  domain JavaScript.

  pattern my_pattern261(x: left_hand_side, a: product, r: term,
                        fn: identifier, angle: identifier):assignment
  =  " \x = \a + \r * Math.\fn ( \angle ) ; ".

The " " characters are meta syntax saying "what's inside here is JavaSrcript syntax"; what's outside the quotes is pattern-language syntax. The "\" character is a pattern-matching metacharacter indicating that the following thing is a meta variable (there are other more sophisticated meta-objects but they weren't needed for this pattern). All the metavariables here are given syntactic type categories for the language of interest. For the pattern variables listed, a matched pattern binds the pattern variables to the corresponding ASTs.

This same pattern would match your other statement.

While such patterns are useful for single statements or other language nonterminals, they don't capture the notion of control and data flow between multiple patterns well. (Here the Cocinelle solution is sort of interesting, with "A ... B" meaning A precedes B in the control flow). DMS doesn't solve this problem directly, but it does have very strong control and data flow analysis for C, COBOL, and Java and one can check the patterns separately and then verify the various control and dataflows are present between bound instances of pattern variables.

Ideally one wants a real control/dataflow relation to be recorded in the pattern itself. This idea was attempted back in the 1980s with a system called the The Programmer's Assistant (PA), which was intended to match a pre-existing libray of code idioms against code that a programmer had to help reason about it. PA never really went very far because the parsing technology was weaker and the machines a lot smaller. But from my point of view, it had all the right ideas. I continue to look to it for inspiration. We actually proposed syntax for a PA like extension of DMS for a Navy SBIR about 10 years ago, but it didn't get funded for further pursuit.

While this is useful when looking for specific patterns, often you don't know what patterns to look for when locating such common concepts. A tool that could find these is probabaly of more interest to you.

An interesting way to find these, is to locate code clones, blocks of code that that have been copy/paste/edited. What you count on here is for the programmer's to identify interesting bits of code by virtue of stealing it; multiple copies lying around in your source code base say it is interesting. We build a tool called the Clone Doctor (CloneDR) which finds such duplication. What it detects are things that look like the patterns above over possibly multiple statements. You can see examples of detected clones here, that look like much bigger parameterized versions of your examples. It even shows you what the pattern that unifies them looks like.

The CloneDR works by matching ASTs, and it scales to millions of lines of code. A more sophisticated approach matches data flow computations (of the same type proposed by the Programmer's Assistant) to find such clones. By switching to dataflow and gettting rid of syntax, some of the normalization you hint at is automatically accomplished. The bad news is that it scales rather poorly and can't be used to investigate big software systems. There is research ongoing trying to scale the dataflow clone matching approach. There is work in so-called algorithm recognition that attempts to use data flow and algebraic laws to reformulate expressions to some kind of normal form before matching occurs. So far it hasn't been very successful as far as I know.

A lesson from running CloneDR (and from other clone detection tools) is that there aren't a lot of "frequent constructs" used by every programmer, unless you count the language constructs themselves but we already knew that :-}. But there's 10%-30% cloned code in any source code base.

And any team working on a source code base will find, IMHO, considerable value in looking at the detected clones and inducing the abstractions. If you work on a code base, the clones found are often very easy to identify into terms of abstract function. (If you don't know the code base or the application, as usual, you are clueless).

Ira Baxter
+1 This is an excellent response and thought-provoking.
peter.murray.rust