views:

235

answers:

9

I wonder if there is a code comparison tool that take into account the call hierarchy of your code. Something like this: you provide the two projects to compare and which method and the file containing it to use as the head of the hierarchy. The tool would extract all methods potentially called by this method (polymorphism and dynamic calls could make this part tricky...) and ultimately lists all the methods that have been modified between both versions.

Is this asking for too much? It feels like versioning systems are stuck in stone age with our file/class based versioning systems.

Edit: By call stack, I don't mean "runtime call stack". In other words, I want this to be a static analysis of the code.

A: 

What you're asking is a whole lot easier with the compiled virtual machine code than with source. Otherwise your analysis tool has to act like a compiler.

It wouldn't be difficult at all to find the direct calls, but as you say things like interfaces get very messy. dynamic would be much harder, and if reflection is involved it would be much harder still. I think it would be possible to identify potential calls from interfaces and dynamic instances, but identifying the likely calls from the potentials would be very difficult to do, and it would be impossible to identify them all.

For example, imagine a method that returns the value of an arbitrary property from an arbitrary object, using reflection (or, I suppose you could do this with dynamic).

public object GetAProperty(object obj, string propname)
{
    // do your magic here
}

The program would have to identify that GetAProperty can potentially access any object. If it was really smart, it could identify all the call sites and possibly determine which classes are passed as parameters to the method. But those call sites could be working with object, as well, so you'd have to go back another step, ad infinitum.

Identifying the direct calls looks like a straightforward problem. Even handling simple inheritance, though, becomes very difficult.

Perhaps somebody more familiar with compilers for modern languages can give more concrete info.

Jim Mischel
the example you provide (GetAProperty(object... )) is irrelevant since you have to cast to an interface or a class before making the call. You then have the information about the call. The problem with polymorphism is that you end up with more possibilities than you wish for but those "false call" will be filtered out if they have not been change. Unless your program changed heavily, it's not much of an issue. Only the `dynamic` is a problem but we don't have those in our code yet (and not a problem with java).
Simon T.
A: 

I don't know if that is at all possible/feasible to an extend where it is safe enough to use as a diff-mechanism.

Optimization-, trace-/debug- or other settings, either directly enabled/disabled between runs or kicked in by "self-tuning" or warmup-effects after a couple of invocations, might yield - to the outside observer - different callstacks for unchanged binaries.

So even though code is actually being called, due to inlining / tail-calls, etc. you might never be able to observe the function "frame" and the callstack.

Then there are things like branching, i.e. different paths taken, and thus different functions being called depending on the state of the program. What, for example, happens when you have code like this:

if (DateTime.Now.Second % 10 == 0)
   FuncA();
else
   FuncB();

Granted, that is a rather contrived example, but you get the idea. Branching can happen depending on values that are not in the program's power to change/prearrange.

Finally, there is another problem with this approach: just because callstacks are equal, doesn't mean that the code performs the same actions or logic. For example, it would total ignores the fact that the code of the functions could have totally changed, while "preserving" the observed callstack.

Christian.K
I edited my post. I used the "callstack" for lack of a better word but I'm looking for a static analysis of the code. I don't care about the call logic. Just show me what changed in the hierarchy and that will save me so much time finding the bug.
Simon T.
+1  A: 

Is this asking for too much? It feels like versioning systems are stuck in stone age with our file/class based versioning systems.

I feel exactly the same way. While I have not found any tool to do this it would actually a very good project to start. First of all text/line based diff-merge it's actually usefull for text files. But programming languages code files are more than that. They are actually tree based documents, where the tree is the one generated by the parser (component from the compiler) which happens to be (sadly) encoded in a text file.

If the code file would actually be edited be a code-editor (I'm not talking about graphical code generators, but text but constrained to legal language constructs) where there is no way to have a syntax error because the editor just don't let you type something not making sense to the compiler (notice that it could still make no sense at runtime).

The code file would not be editable (or at least not recommended) in plain text editor but only from IDEs (I know some people will complain at this but a plain text editor is also an specialized editor for text, that knows about encodings ASCII... Unicode, UTF8, etc; instead of using an hex editor to write bytes to files.) then it is posiible to mantain a tree based document like for example other kind of documents like word, excel autocad, photoshop, etc filetypes.

Sadly I believe this is not gonna happen soon anyway at mainstream languages because of the same reason migrating from Office 2003 to 2007 is painfull. Because the two formats are incompatible. We could have a backwards compatibilty problem. But even in this case a parser just like the one in compilers could be used to mantain a tree version of the document.

Now that this is posible then the merge/diff tools will need to operate on not arrays of text lines, but on trees of language constructs. Thisway you can have a better way to compare different code files for example not seeing that line #150 on the left file has XXXXXXXXX and the right file has YYYYYYYY on the same line, but something like: in this part of your code (maybe something like a breadcrumb could be shown) in left file there is a call to method A with a,b,c arguments and in right side just with a,b. Or maybe in left file the Method M() was called but in right file N() was called instead.

Another benefit from this kind of merge tools is that it would solve the problem with autoformatting document or with coding styles. If someone on your team use spaces but other uses tabs (4 spaces for each tab) and a crazy guy uses 3-spaced tabs and someone uses 0xOA 0x0D as newline and another one just 0x0A and someone write variable decalrations like this:

int a=0; int b=0;

Other guy like this:

int a = 0;
int b = 0;

Other crazy guy like this:

int   a   =   0;b=0             
          ;

Or differences between this:

int A() {
}

Or this:

int A()
{
}

It just would matter because the merge would treat them equally and would say that there is no change. So many people would stop just editing files just to make the format look correctly according to them. (Ctrl+K, Ctrl+D-ing all the files in Visual Studio).

I beleive this wolud reduce the number of conflicts when doing merges. And not only that. Since the merge tool know about the language (And posibly the text file is sintactically correct) then it would be posible for the merge tool to make only merges that could compile and not just mixing some characters in text based merges. If the produced merge does not compile then there is a conflict. Where you can have a better understanding of the diferences of meaning of the code not just difference in individual characters.

... After doing some research I just found this answer that appears to be something like I was trying to explain: http://stackoverflow.com/questions/523307/semantic-diff-utilities/621557#621557

Carlos Muñoz
Enjoyed reading your post. Also the link is to a question that is exactly what this one is asking. Double ++ for you.
Andrew Hubbs
@Andrew Thanks. I also enjoyed writing it since that was exactly what I was discussing with some colleagues the other day
Carlos Muñoz
Great post. That's exactly what I think and each time I suggest this idea to a fellow programmer I'm pointed at like an heretic. Did you know the language in Zoho Creator is stored in a database as a tree-representation? That's why it was so easy for them to create a port to Google Apps. Storing the language representation in some kind of database would open the door to some powerful code analysis and amazing source control.
Simon T.
...Versioning could be at the method level. Comments would be searchable metadata (plus the possibility of adding other metadata). Actually, I envision a wiki-style section for comments associated at each class and methods. Thinks also about bug tracking being integrated. Collaborative work would be much easier.
Simon T.
You don't need to store the program as a tree (yes, its OK to do that). All you need is a good parser.
Ira Baxter
A: 

I think it's asking too much.

In any system complex enough for such functionality to be useful, I'd expect to see separation of contracts from implementation using techniques such as dependency injection, so that such a tool would be pretty much impossible to create.

Joe
That's the most common fallacy I encounter when I suggest such ideas: the tool wouldn't be useful on the systems I work on or can imagine so it wouldn't be useful to anyone.
Simon T.
+2  A: 

Since nobody has outlined an off-the-shelf solution, I'll outline how to get a solution.

For a customer, we built most of what you are asking for, only the question was, "does this function have a side effect" which corresponds to "is there a local side effect, or does this function call another function with a side effect".

For that customer, what we actually did was to compute a system-wide call graph (this was for 10 million lines of C complete with nasty pointer casts on function references) as a baseline. Then, for each function, we directly labelled it as to whether it had a local side effect or not, and then propagated "has a side effect" to all callers as defined by the call graph.

Your question is almost the same; you want to know if this function has changed, semantically, or it calls a function which has changed, semantically. For this you need to compute the call graph as before. You also want a "semantic diff" routine that can tell you if a function has changed or not.

Our SD Smart Differencer is a tool that computes differences over files and can identify (for many simple cases, such as code reformatting, identifier renaming in scopes and some code rearrangements depending on the language) whether a block of code is different or not. The smart differencer, based on language-specific parsing technology, is not affected by code formatting/layout changes.

It would be pretty easy to modify the smart differencer to produce "this function is different" and with the call graph you have the basis to determine the answer.

Ira Baxter
The problem I outlined seems in fact simpler to solve than the one you worked on for your client. Do you plan to release you product to the public and have an evaluation version or is something targeted at specific clients with custom version? Your tool seems to have a lot of potential and could revolutionized the source control systems if it was to be integrated with one.
Simon T.
This isn't presently packaged as a releasable solution. The machinery on which it is based is production capable, and a packaged solution is an interesting concept. I don't see how would affect source control systems directly, as it appears to be an solution to an orthonogal problem (that of deciding what to test), which is found near every source-control system.
Ira Baxter
I see a strong link between the need I outlined and source control systems. The only tool most developers have to track change in their software is the source control. The history is of your code is really valuable but sadly source control show you the change in your code on a file basis. I believe we should do versioning at the method level. If source control could offer only that it would be a huge step forward. A version number would be attached to each method and increment automatically when the method is modified. Taking into account the change in the calls would be another step.
Simon T.
So your point is that doing deltas on the methods would be very good starting point (that is, the call graph is useful but an enhancement). ... carrying this to an extreme, why wouldnt you do version control on arbitarily small structures in source code? (blocks, statements, expressions, identifiers)?
Ira Baxter
Exactly. The evolution of source control in the last 10 years is a big joke. Remember the excitement surrounding the release of Subversion just because it fixed a few of the biggest flaws of CVS and SourceSafe. Then people got all excited when Git fixed some of the biggest flaws of Subversion. Then came Mercurial which didn't fundamentally improve that much over Git but got people excited all again, and you have systems like Darcs which try to do things that are basically impossible with the flawed paradigm of file-based source control.
Simon T.
So one of the reasons we built the SmartDifferencer was to take a step at finding changes at finer-than-file-granularity level, whether the the version control tools could handle it or not. You can download an eval copy (for Java, C#, C++, and variety of other languages) and tell me what you think of it. I think we got the fine-grain part pretty much right; the notion that one should explicitly call out methods (and perhaps higher level constructs: classes, interfaces, ...) we don't do but seem like a good idea; we have all the necessary information.
Ira Baxter
A: 

At a casual glance, this doesn't seem super difficult.

The first step is to get an AST (Abstract Syntax Tree) from the two pieces of code under consideration. The AST is typically created as part of the parsing process. Once it has an AST, the compiler walks the tree and create the lower level code.

When you have two ASTs for two pieces of code, the AST normalizes the code representation, making comparison easier.

Structural changes are readily identified.

For example:

if (a = 1) {
    print "a is 1";
}

and

if(a=1){print "a is 1";}

are represented identically in an AST.

Even

if(a=1) print "a is 1";

would likely be the same.

Identifiers make it more difficult.

Consider:

function a(b, c) {
    var x = 1;
    var y = 1;
    var z = b + c + x - y;
 }

vs

function a(arg1, arg2) {
    var first = 1;
    var second = 1;
    var result = arg1 + arg2 + first - second;
}

these are identical functions. Normalizing the identifiers should make the two identical.

But, consider:

function a(arg1, arg2) {
    var second = 1;
    var first = 1;
    var result = arg1 + arg2 + first - second;
}

This is a DIFFERENT function. To us it is, arguably, the same function. But via the AST, it's different. Only the fact they're both equal to 1 makes them the same.

So, with more advanced call graphs and constant folding, these could be considered the same.

But you can see the processes involved in doing (what I think) you want. The basics aren't that hard.

For example, the Sun Java 6 JDK provides classes that can compile java and make available an AST.

http://today.java.net/article/2008/04/09/source-code-analysis-using-java-6-apis

Will Hartung
What you suggest is going even further than what I want. Basically, I could live with the fact that a change with no impact on the behavior of the method is reported as a change (with the exception of whitespace and such). If it's an "easy" problem to solve, why don't we have such tools? I think it's because we handed the problem of tracking history and version to source control system that are agnostic to the programming language they store. The source control has the data (code history) but it's the compiler that has the capacity to do such analysis and both systems don't talk to each other.
Simon T.
You're likely right. Folks use VCS for more than just source code, and writing language analyzers for some languages are more difficult than others (C++ vs, say, Common Lisp). There may well be commercial tools that do what you want, however. But, "diff" is certainly far, far easier than an AST comparator. But if that was something you were interested in, for Java at least, it's probably pretty straight forward.
Will Hartung
Will is right. Diff *is* much easier than comparing trees, and a significant part of the problem is getting high quality parsers for a wide variety of langauges. We tackled this problem by building generic program parsing and analysis machinery first, and then built language parsers on top of that, and then built generic AST diffing tools, parameterized by language definitions.
Ira Baxter
A: 

I'm not sure if this is fulfills your requirements but this look interesting http://www.wiresong.ca/monticello/ for squeak/pharo smalltalk.

As with the rest of the Smalltalk tools I think it leans on the fact that its living inside of an image of running code and it can basically peek inside everything.

Roman A. Taycher
A: 

Simon, I am not quite whether if there is any such tool available. I assume that your main purpose should be to easily find out the broken issues between two revisions (or) versions of the same project.

So to find the broken revisions, we built an internal tool which typically pulls the source from the SVN, compiles it and executes the application with some set of predefined inputs and stores the entire execution stacktrace to the text file using the below property.

Console.WriteLine("StackTrace: '{0}'", Environment.StackTrace);

Then we compare the execution path of 2 revisions using diff tools, if there is any conflict (i.e. change in execution paths) then it is most likely the origin of the issue.

One another alternative is by taking advantage of the VS automation as explained in the below post.

http://stackoverflow.com/questions/2954079/how-to-get-stack-trace-of-a-running-process-from-within-a-visual-studio-add-in

But in both the case, the approach is still the same, but you could get much more details of the execution paths by following 2nd approach.

Hope this helps,

Karthik
A: 

I don't think there is a tool doing this out-of-the-box.

Actually, you need three things:

  1. Generate a (machine-readable) call graph for your code. This SO thread is probably a good place to start from.
  2. Find the changed methods. Here is a way to do it in SVN. CVS, it seems, has '-p' switch that does this.
  3. A script/tool to put the things together: generate a call graph, call diff, mark the changed methods and all its callers. I'd suppose, this one you have to create by yourself.

A somewhat tricky part could be, what happens if the call graph changes between versions. Probably, taking into consideration only the "new version" of call graph is enough for your purposes.

VladV