tags:

views:

50

answers:

3

Hi all

Given a code base (say for example a large C or Objective-C project) I would like to analyze the sourcecode files and pick out symbols of interest. They might be class declarations, variable names or types, or method names. Is there a Python module that could help me with this?

The only approach I can see going forward is to use regular expressions to gather these symbols, but I'm thinking this could get very ugly very quickly. I'm also not an expert in compilers or parsers, so something lighter-weight would be prefereable.

thanks for any suggestions.

------ update -----

thanks for all of the suggestions so far, definitely some promising leads. One other avenue that may be possible: what if I were able to compile the project I was trying to analyze. Would the debugging symbols (dsym) make this process any easier? I'm not looking for anything advanced, just a list of classes, with their ivar and method names. At this point, looking into the parsing tools suggested seem like more work than I can afford to invest in this project right now

+5  A: 

Regex is definitely not a good way to examine programming language code. I would suggest choosing a parsing module from the links provided below. There are a few tools out there that you could use. They all provide parsing facility. You can always build your stuff on top of that:

pygccxml generates xml description from c++ program files. This might be closer to what you are trying to do:

Also look at this, it generate navigable class tree representing the class structure.

pyfunc
+1  A: 

Regular expressions are not the way to go here. The languages already have a defined grammar, so use that.

Daenyth
+1  A: 

Our Search Engine has a facility for picking all identifiers using the langauge structure (it specifically handles C at this point, but not Objective C). The Search Engine provides an interactive query langauge for searching for various langauge constructs, displaying hits, and displaying source text that matches hits. We are about to release a version that finds definitions and uses, which would pick out function/type/variable declarations directly. This would be considered "lightweight".

Related is the Search Engine's big brother, the DMS Software Reengineeering Toolkit. DMS with its C Front End provides the ability to fully parse C code and find arbitrary symbol definitions. This would be considered "heavy duty" in that it has a full preprocessor and gets the definition information absolutely right, as well as providing complete access to the AST that associated with the symbol name (declaration, function, typedef, ...).

These are not Python modules, but do provide precise access to the kind of informaiton likely of interest.

Ira Baxter