ansaurus

Question

Answer 1

+1 A:

I think your question is flawed. The title says "Static source code analysis". Yet your underlying reason appears to be the construction of (part of ) a call graph including calls through a function pointer. The essence of function pointers is that you cannot know their values at compile time, i.e. at the point where you do static source code analysis. Consider this bit of code:

void (*pFoo)() = GetFoo();
pFoo();

Static code analysis cannot tell you what GetFoo() returns at runtime, although it might tell you that the result is subsequently used for a function call.

Now, what values could GetFoo() possibly return? You simply can't say this in general (equivalent to solving the halting problem). You will be able to guess some trivial cases. The guessable percentage will of course go up depending on how much effort you are willing to invest.

MSalters 2010-03-01 13:23:32

I know about this, I plan to implement it by myself if none is available. I would like to create a slicing program based on LLVM but I dont know if first LLVM meet my need, and second how to start programming my "LLVM extension".

Phong 2010-03-01 13:33:03

Maybe stackoverflow is not the best place to ask. I will try the LLVM forum directly. Thanks for your help.

Phong 2010-03-01 13:34:20

Theory only says that an analyzer cannot return the exact set of values possibly returned by `GetFoo()` and terminate for all input programs. It does not say that it's impossible for the analyzer to terminate in all cases as long as you are willing to include in the set values that are not actually taken, or to exclude some values that would have been taken. It certainly does not say that it's impossible to return accurate sets for the programs that people are interested in in practice.

Pascal Cuoq 2010-03-01 13:39:21

@Pascal: I don't need an analyzer to determine a superset of possible return values; the set of all (correctly typed) functions in my program is such a superset - in the absence of `dlopen()` of course. And that's the _real_ killer.

MSalters 2010-03-01 13:47:48

@MSalters I'm not sure what you mean by "*real* killer". If there was any interest at all in the precise analysis of programs with dlopen(), it would already be done. It's not as if dlopen was doing something more complicated than what, say, http://www.astree.ens.fr/ already analyzes in codes for which there is an interest.

Pascal Cuoq 2010-03-01 13:58:55

@Pascal: the stated goal of Phong is to determine the callee (i.e. target of a function pointer) with static (source code) analysis. Now consider the common case where that pointer is obtained from a `dlopen()` ed library. This means that the set of candidate functions is even larger than the set of functions in the main executable. You may not even statically know which libraries are loaded at all (i.e. plugin systems that `dlopen()` anything found in a particular directory)

MSalters 2010-03-01 16:02:01

I'm confused. We were talking about the halting problem and now you're basically saying that it's impossible to analyze source code that is not available. Is the halting problem about telling if a program that you don't have terminates? That's not the way I learned it.

Pascal Cuoq 2010-03-01 19:47:18

Thanks for this passionate debate. With my analysis I actually want to extract caller callee function from the available source code (this does not include function in a library (except if the source code is available).

Phong 2010-03-02 00:05:49

Answer 2

+3 A:

I think that Clang (the analyzer that is part of LLVM) is geared towards the detection of bugs, which means that the analyzer tries to compute possible values of some expressions (to reduce false positives) but it sometimes gives up (in this case, emitting no alarm to avoid a deluge of false positives).

If your program is C only, I recommend you take a look at the Value Analysis in Frama-C. It computes supersets of possible values for any l-value at each point of the program, under some hypotheses that are explained at length here. Complexity in the analyzed program only means that the returned supersets are more approximated, but they still contain all the possible run-time values (as long as you remain within the aforementioned hypotheses).

EDIT: if you are interested in possible values of function pointers for the purpose of slicing the analyzed program, you should definitely take a look at the existing dependencies and slicing computations in Frama-C. The website doesn't have any nice example for slicing, here is one from a discussion on the mailing-list

Pascal Cuoq 2010-03-01 13:32:46

Thanks for the information

Phong 2010-03-04 01:35:47

Answer 3

+1 A:

In our project, we perform static source code analysis by converting LLVM bytecode into C code with help of llc program that is shipped with LLVM. Then we analyze C code with CIL (C Intermediate Language), but for C language a lot of tools is available. The pitfail that the code generated by llc is AWFUL and suffers from a great loss of precision. But still, it's one way to go.

Edit: in fact, I wouldn't recommend anyone to o like this. But still, just for a record...

Pavel Shved 2010-03-01 16:57:39

How could you convert LLVM bytecode into C with `llc`? I'm not familiar with LLVM, but based on the man page of `llc`, it compiles into assembly language?

Wei Hu 2010-03-28 01:22:48

@WeiHu, it actually does compile into assembler, but it also can write that code in C. It has a special "architecture" turned on with option `-march=c`.

Pavel Shved 2010-03-28 07:21:53

Interesting feature! I wanted to upvote but got "Vote too old to be changed" :(

Wei Hu 2010-03-28 08:12:04

Answer 4

+2 A:

You should take a look at Elsa. It is relatively easy to extend and lets you parse an AST fairly easily. It handles all of the parsing, lexing and AST generation and then lets you traverse the tree using the Visitor pattern.

class CallGraphGenerator : public ASTVisitor
{
  //...
  virtual bool visitFunction(Function *func);
  virtual bool visitExpression(Expression *expr);
}

You can then detect function declarations, and probably detect function pointer usage. Finally you could check the function pointers' declarations and generate a list of the declared functions that could have been called using that pointer.

Gabe 2010-03-03 18:48:42

I will have a look on it. From the example you provided, looks really easy to use.

Phong 2010-03-04 02:02:19

Traversing a single parse tree is a long way from collecting points-to data across an entire system of files and computing a call graph using the results of a global points-to analysis.

Ira Baxter 2010-03-11 05:35:34

Answer 5

+1 A:

The DMS Software Reengineering Toolkit provides various types of control, data flow, and global points-to analyzers for large systems of C code, and constructs call graphs using that global points-to analysis (with the appropriate conservative assumptions). More discussion and examples of the analyses can be found at the web site.

DMS has been tested on monolithic systems of C code with 25 million lines. (The call graph for this monster had 250,000 functions in it).

Engineering all this machinery from basic C ASTs and symbol tables is a huge amount of work; been there, done that. You don't want to do this yourself if you have something else to do with your life, like implement other applications.

Ira Baxter 2010-03-11 05:42:57

Thanks for the information. But this tool is not a freeware, so I can not use it in my application.

Phong 2010-03-11 06:45:36

ansaurus

tags:

views:

answers:

Static source code analysis with LLVM

related questions