views:

1436

answers:

11

I'm a self taught programmer with almost non CS background. I'm currently learning about parsing techniques/algorithms/tools, and have a desire to build programs to analyze Java code.

What kind of mathematical/theoretical CS do you have to know or take into account to build the most basic static analysis tools? A very simple thing that comes to my mind is a tool for detecting where (lines of code) and when (conditions) some kind of variable is being written to.

As far as I know, knowing how to write parsers/compilers is the faily minimum thing you must grok... Any books/online courses you'd recommend?


UPDATE: Very interesting answers so far. For the benefit of the community, I'll keep a short summary with links to the tools/topics and books you have recommended:

CS Topics/Themes

Essential books

Advanced books

Parsing/Analysis Tools and Frameworks

+8  A: 

While this does not directly address your question, you should probably take a look at FindBugs. I've been using it for a few years and it's awesome. Also have a look at PMD/CPD. These are both open source and you can learn a great deal from them.

Limbic System
+1 I'd recommend findbugs as well.
Mark Davidson
Findbugs, and PMD/CPD are excellent tools. I would also suggest taking a look at the JSR305 annotations (http://findbugs.sourceforge.net/manual/annotations.html) and the Java Concurrency in Practice annotations (http://jcip.net/), which Findbugs can use to produce more accurate results.
Andrew Newdigate
+1  A: 

I'd recommend a Java LL(*) parser generator called ANTRL. It's an excellent tool, and there's a terrific book available to show you how to use it. There's a Java grammar that you can use right out of the gate, leaving you to write the analysis code you want once you have the AST in hand.

duffymo
Just a small mistake, but it's ANTLR, not ANTRL.
dancavallaro
Note that if you have used YACC/BISON style parser generators then don't expect much beyond BNF to translate directly to ALTLR. It's set up for an almost totally different approach to the whole problem so do a lot of reading before you dive in much.
BCS
Just noticed the spelling error. Thanks.
duffymo
Don't expect even a YACC BNF to translate directly to ANTLR. YACC is LALR and left recursion rules are easy to handle. ANTLR is LL and can't handle left recursion. This isn't an objection to ANTLR, just an observation that switching parsing engines isn't all that easy.
Ira Baxter
Thank you Ira, great comment.
duffymo
+2  A: 

I recommend the Dragon Book on compilers:

http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811/ref=dp_ob_title_bk

BobbyShaftoe
I'm told that the Dragon Book, while a classic, is largely passe these days. The optimizations they recommended and far more are done at runtime by JIT compilers.
duffymo
Since the question is about static analysis not optimizing bytecode generation, the Dragon book is still as relevant as ever because an analyzer must at least be smart about the AST.
Alan
JIT compilers are compilers so optimizations are still relevant, just not where they used to be.
BCS
Passe? That's interesting. If you want to do any kind of static analysis, I think the Dragon Book is required reading. Now, perhaps you need to read more but the question was about how he could gain the necessary CS background and so forth.
BobbyShaftoe
A: 

Take a look at abstract interpretation.

starblue
+7  A: 

I don't know of any good book that is primarily about program analysis. It's too bad; the topic is interesting and important. There are lots of tools and research papers out there, but probably hard to get started for a beginner. However, you can probably learn some useful things from some of the more recent compiler books:

And there's at least one very powerful tool you should learn about:

  • If you want to analyze Java code, stand on the shoulders of Andrew Myers: take a look at the Polyglot framework for analyzing Java code.
Norman Ramsey
The Polyglot framework looks great. Any experience using it?
namin
+3  A: 

As someone else mentioned, the dragon book is a good place to start.

If you want to get more indepth, then I reccomend the following 2 books, in the order listed:

  1. Types and Programming Lanauges, by Benjamin C. Pierce
  2. Principals of program analysis, by Flemming Nielson, Hanne Riis Nielson, and Chris Hankin

Now, these books are a bit heavy on theory, which might put some people off, but I think they are worth your time.

There is a little bit of danger in taking a theoretical approach, mainly that you can get caught up in the "rigor" of the mathematical modeling, and end up forgetting about the end goal in the first place (which is creating development tools for programmers). Some people kind of go off the deep end as a result. If you keep that in mind when you are going through the books it shouldn't be a problem though.

The reason I recommend taking a theoretical approach is that working through the proofs exercises in the Pierce book trains your brain for analyzing programing language semantics in the way that you need to when working on a real programing language. Doing the proofs basically boils down to thinking through the different possible combinations of things in the type system and how they interact with each other, and then reasoning about them.

Once you do it a few times, it starts to become intuitive. Going through that process quickly and intuitively is invaluable when working on a compilers, defining programing languages, or writing development tools.

The Nielson, Nielson, and Hankin book is a HARD read. I had the benefit of a professor, a class of students, and 2 lectures a week to help me, and it was tough. Going through it by your self won't be easy. However, it really covers the foundations of program analysis throughly, which would be really helpful if you want to start doing the kinds of analysis you are talking about.

As far as math knowledge goes, you need to be familiar with "discrete math" concepts. This includes things like "partial orders", "lattices", and basic set theory. Any CS "discrete math course" or a "modern algebra class" from the Math dept should give you the background you need.

Scott Wisniewski
Very thoughtful answer. Ah, the book from Benjamin Pierce seems a worthy read, I've received recommendations of it more than once before... :)
Camilo Díaz
I'm glad somebody appreciates it....The answer that said "I don't know if there are any books about program analysis" got more upvotes
Scott Wisniewski
@scott: The answer said "I don't know of any good book...", which is true. The Pierce book isn't really a static analysis book, and Principles of Program Analysis I would not describe as a "good book".
Paul Biggar
+3  A: 

There are two main approaches to stuff like this: source analysis and bytecode analysis. (Maybe the third non-static approach is dynamic runtime analysis.)

For source code analysis, you'll want to dive into parsers and abstract syntax trees. Your goal would be to break down the source into an AST and then walk over the AST computing things. As already mentioned, ANTLR is an excellent tool for that. Some other Java tools are JavaCC, and probably Jackpot is an interesting thing to look at. Popular open source static analysis source tools are things like PMD and Checkstyle if you want to look at examples.

For byte code analysis, you'll instead want to break apart the class structure and examine streams of bytecodes. For this, the best library to use is ASM. You can create visitors with ASM that walk over the instructions and other aspects of the class file. At this level you'd need to detect the bytecodes for when variables are written and so. ASM has a bunch of tools to help you with common tasks as well. Easily the best open source bytecode analysis tool is FindBugs, which has all the infrastructure already set up if you want to write your own "bug" detectors with it (in ASM or BCEL).

Coverity is a commercial company that makes some interesting dynamic or mixed static/dynamic tools for things like data race and deadlock detectors.

Alex Miller
+3  A: 

I highly recommend The Purple Dragon Book. I used it for my compilers class in college, and it really does cover most of things in a 'traditional' compiler, with the recent additions talking a bit about JITing and garbage collection. It has great pseudo-code for every algorithm, and gives thorough explanations of the theory. I wouldn't call it extremely easy to pick up, but compiler design is not a simple subject. I also recommend looking into tools like bison/flex; they can be extremely useful for generating parsers and scanners. You'll want to look into live-variable analysis for what you were talking about.

Mongoose
+6  A: 

I'm afraid there isn't a good introductory text on the theory of static analysis for bug finding. As several posters have already noted, you should read the Dragon Book to learn about compilers before you start to learn about static analysis, but that's a prerequisite, not an introduction. Somewhat similar considerations apply to Nielson’s Principles of Program Analysis: it's a good academic textbook on several forms of program analysis, but I couldn't find much relevance to static analysis for bug finding in the parts I got through. There’s a book, Secure Programming with Static Analysis, by the founders of Fortify, but it focusses on security and their tool, not on the general theory of static analysis. A couple of prominent researchers in the field (Alex Aiken of Stanford and Tom Ball of Microsoft) are rumored to be working on textbooks, but nothing's available yet. [Later: According to Professor Engler, Professor Aiken’s course on software engineering (which Engler once taught, oddly enough) covers static analysis; the syllabus and slides are available at http://www.stanford.edu/class/cs295/.]

The best introduction I’ve found is Dawson Engler’s academic articles, available on his Stanford web page, http://stanford.edu/~engler/, starting with “Bugs as Deviant Behavior.” Professor Engler is one of the founders of Coverity, but the interest of his work isn’t limited to Coverity users. Indeed, there’s a lot of important stuff (e.g. defect ranking) in his academic work which is not reflected in Coverity’s products.

I’ve added a few recent bibliographies to the Wikipedia article on Static code analysis; the most useful is probably Livshits’ Stanford thesis. I also list the International Static Analysis Symposium proceedings there, but I’ve found them disappointing for practical applications.

Flash Sheridan
The Livshits thesis is pretty interesting.
Ira Baxter
+1  A: 

I'm surprised no one has mentioned it yet, but for someone with a non-CS background I wouldn't recommend some of the compiler books above -- at least to get started.

Instead I'd recommend Secure Programming with Static Analysis.

Don't worry about the "secure" part in the title. The authors of the book created a commercial static analyzer whose primary interest was in finding security defects, but the book actually has very good introduction to static analysis in general.

Garen
A: 

For tool foundations, what you want are parsing engines, customizable analysis frameworks, and mature langauge definitions, to enable you to build a static analysis tool that solves a specific problem.

Our DMS Software Reengineering Toolkit is a set of generic analysis and transformation tools used to build custom static or dynamic analyzers, and/or code transformation tools, parameterized by explicit langauge definitions. DMS is an interesting study in how to provide frameworks for general program manipulation.

DMS provides built-in support machinery for parsing, syntax tree building, symbol table construction, control and dataflow analysis, local and global points-to analysis, call graph construction and symbolic range analysis. It provides source-to-source transformation in terms of the surface syntax of the langauge definitions provided. Transformed ASTs can be regenerated back to compilable text, complete with original comments. The transformational aspect can be used to implement dynamic analysis.

There are mature langauge definitions (used in anger) for C, Java and COBOL handling multiple dialects, all tied to all the analysis machinery above. There is a C++ front end with full symbol tables (derived by Keonig lookup) and type analysis for expressions.

Ira Baxter