views:

474

answers:

5

I am trying to write a program to check that some C source code conforms to a variable naming convention. In order to do this, I need to analyse the source code and identify the type of all the local and global variables.

The end result will almost certainly be a python program, but the tool to analyse the code could either be a python module or an application that produces an easy-to-parse report. Alternatively (more on this below) it could be a way of extracting information from the compiler (by way of a report or similar). In case that's helpful, in all likelihood, it will be the Keil ARM compiler.

I've been experimenting with ctags and this is very useful for finding all of the typedefs and macro definitions etc, but it doesn't provide a direct way to find the type of variables, especially when the definition is spread over multiple lines (which I hope it won't be!).

Examples might include:

static volatile u8 var1; // should be flagged as static and volatile and a u8 (typedef of unsigned 8-bit integer)
volatile   /* comments */   
    static /* inserted just to make life */
        u8 /* difficult! */   var2 =
        (u8) 72
           ; // likewise (nasty syntax, but technically valid C)
const uint_16t *pointer1;  // flagged as a pointer to a constant uint_16t
int * const pointer2; // flagged as a constant pointer to an int
const char * const pointer3; // flagged as a constant pointer to a constant char
static MyTypedefTYPE var3; // flagged as a MyTypedefTYPE variable
u8 var4, var5, var6 = 72;
int *array1[SOME_LENGTH]; // flagged as an array of pointers to integers
char array2[FIRST_DIM][72]; // flagged as an array of arrays of type char

etc etc etc

It will also need to identify whether they're local or global/file-scope variables (which ctags can do) and if they're local, I'd ideally like the name of the function that they're declared within.

Also, I'd like to do a similar thing with functions: identify the return type, whether they're static and the type and name of all of their arguments.

Unfortunately, this is rather difficult with the C syntax since there is a certain amount of flexibility in parameter order and lots of flexibility in the amount of white space that is allowed between the parameters. I've toyed with using some fancy regular expressions to do the work, but it's far from ideal as there are so many different situations that can be applied, so the regular expressions quickly become unmanageable. I can't help but think that compilers must be able to do this (in order to work!), so I was wondering whether it was possible to extract this information. The Keil compiler seems to produce a ".crf" file for each source file that's compiled and this appears to contain all of the variables declared in that file, but it's a binary format and I can't find any information on how to parse this file. Alternatively a way of getting the information out of ctags would be perfect.

Any help that anyone can offer with this would be gratefully appreciated.

Thanks,

Al

+3  A: 

There are a number of Python parser packages that can be used to describe a syntax and then it will generate Python code to parse that syntax.

Ned Batchelder wrote a very nice summary

Of those, Ply was used in a project called pycparser that parses C source code. I would recommend starting with this.

Some of those other parser projects might also have sample C parsers.

Edit: just noticed that pycparser even has a sample Python script to just parse C type declarations like the old cdecl program.

Van Gale
+1: This has already been done -- several times over.
S.Lott
A: 

I did something similar for a project I was working on a few years ago. I ended up writing the first half of a C compiler. Don't be alarmed by that prospect. It is actually much easier than it sounds, especially if you are only looking for certain tokens (variable definitions, in this case).

Look for documentation online about how to scan C source code, detect tokens of interest, and parse the results. A good place to start is Wikipedia's artricle on lexical analysis.

e.James
+2  A: 

Check out ANTLR. It's a parser generator, with bindings for python. The ANTLR site provides a whole bunch of grammars for common languages, C included. You could download the grammar for C and add actions in appropriate places to collect the information you're interested in. There's even a neat graphical tool for creating and debugging the grammars. (I know that seems hokey, but it's actually quite handy and not obnoxious)

I just did something sort of similar, except to get my symbol information I'm actually extracting it from GDB.

Ryan
+2  A: 

What you're trying to do is a lightweight form of static analysis. You might have some luck looking at the tools pointed to by Wikipedia.

Parsing the C code yourself sounds like the wrong direction to me: therein lies madness. If you insist, then [f]lex and yacc (bison) are the tools likely used by your compiler-writers.

Or, if ctags or cscope gets you 80% of the way, the source code to both is widely available. The last 20% is a Simple Matter of Programming. :)

+2  A: 

How about approaching it from the other side completely. You already have a parser that fully understands all of the nuances of the C type system: the compiler itself. So, compile the project with full debug support, and go spelunking in the debug data.

For a system based on formats supported by binutils, most of the detail you need can be learned with the BFD library.

Microsoft's debug formats are (somewhat) supported by libraries and documents at MSDN, but my Google-fu is weak today and I'm not putting my hands on the articles I know exist to link here.

The Keil 8051 compiler (I haven't used their ARM compiler here) uses Intel OMF or OMF2 format, and documents that the debug symbols are for their debugger or "any Intel-compatible emulators". Specs for OMF as used by Keil C51 are available from Keil, so I would imagine that similar specs are available for their other compilers too.

A quick scan of Keil's web site seems to indicate that they abandoned their proprietary ARM compiler in favor of licensing ARM's RealView Compiler, which appears to use ELF objects with DWARF format debug info. Dwarf should be supported by BFD, and should give you everything you need to know to verify that the types and names match.

RBerteig
+1Fully agree. I took the same approach but used libdwarf library that supports ELF with DWARF format - works like a charm.
qrdl