tags:

views:

481

answers:

7

I'm looking for a way to parse c++ code to retrieve some basic information about classes. I don't actually need much information from the code itself, but I do need it to handle things like macros and templates. In short, I want to extract the "structure" of the code, what you would show in a UML diagram.

For each class/struct/union/enum/typedef in the code base, all I need (after templates & macros have been handled) is:

  • Their name
  • The namespace in which they live
  • The fields contained within (name of type, name of field and access restrictions, such as private/mutable/etc)
  • Functions contained within (return type, name, parameters)
  • The declaring file
  • Line/column numbers (or byte offset in file) where the definition of this data begins

The actual instructions in the code are irrelevant for my purposes.

I'm anticipating a lot of people saying I should just use a regex for this (or even Flex & Bison), but these aren't really valid, as I do need the preprocessor and template stuff handled properly.

+3  A: 

Running Doxygen on the code would give you most of that, wouldn't it?

In what format do you want the output?

therefromhere
From what I've seen, Doxygen can really only output human friendly data, parsing that might take a bit more for a program, I just need it in a format where I can easily access that info through code.
Grant Peters
Doxygen can also output XML: http://www.doxygen.nl/config.html#cfg_generate_xml
Éric Malenfant
A: 

You can easily get macros expanded by just running pre-processor (cpp) on the source. The templates are not that easy since template instantiation happens much later.

Nikolai N Fetissov
This was one option i had in the back of my mind, but I've recently had some dealings with the preprocessor for 2 completely different providers (one sony, the other MS) and the output they give is actually different from what is used internally (MS MIGHT just be a whitespace error, but it does cause errors that prevent the pre-processed file from being created. Plus, I still need something to pass the code.
Grant Peters
+4  A: 

Sounds like a job for gcc-xml in combination with the c++ xml-library or xml-friendly scripting language of your choice.

Georg Fritzsche
From the description on the page you linked, this sounds like its exactly what I'll need.
Grant Peters
Cool, the faq also says that instantiated templates are also logged, this sounds perfect (only thing it seems to lack is the "function bodies", which is the one thing I really don't need at all)
Grant Peters
Good to hear, i think its convenient for simpler cases.
Georg Fritzsche
So you parse C++, emit XML and than parse XML... I'm impressed. (Hint: rampant sarcasm).
MaD70
If you have to much money you can buy front-ends from the Edison Group. If too much time you hack up gcc to give you what you need. Or you implement a complete c++ parser yourself. Have fun with that. (Hint: Irony intended)
Georg Fritzsche
+1  A: 

Exuberant Ctags will give you most of what you need, it's usually used by editors to provide code navigation.
May choke on some templates though...

Eugen Constantin Dinca
A: 

Doxygen can also produce a detailed XML by setting an option in the configuration file. It is quite thorough, and very easy to use. From the Doxygen home page:

The XML output consists of a structured "dump" of the information gathered by doxygen. Each compound (class/namespace/file/...) has its own XML file and there is also an index file called index.xml.

A file called combine.xslt XSLT script is also generated and can be used to combine all XML files into a single file.

Doxygen also generates two XML schema files index.xsd (for the index file) and compound.xsd (for the compound files). This schema file describes the possible elements, their attributes and how they are structured, i.e. it the describes the grammar of the XML files and can be used for validation or to steer XSLT scripts.

In the addon/doxmlparser directory you can find a parser library for reading the XML output produced by doxygen in an incremental way (see addon/doxmlparser/include/doxmlintf.h for the interface of the library)

cdiggins
+2  A: 

The DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery. Its C++ Front End builds on DMS to provide full featured C++ parsing for a variety of common C++ dialects, can process set of C++ classes simulataneously, and constructs full name/type/access information that you can use any way you want. Information is tagged as to precise origin file/line/column. (It includes a full preprocessor).

You are right; regex can't even come close to this.

Ira Baxter
Correct me if I'm wrong: an half-baked solution will not be useful; either one parse it in full or wrong/missing results, in whatever extraction process at which one submit the code, is to be expected.
MaD70
The meaning of code is pretty fragile, and depends crucially on the meaning of the user symbols. Minor errors in interpreting this meaning usually ripple into results a few operators downstream that are nonsense. If you don't parse C++ pretty in excruciating detail, you can't really build any interesting analyzers let alone tools that can change code reliably.
Ira Baxter
Thanks for sharing with us your experience, Ira.
MaD70
A: 

See also Ira Baxter here, where he cites his own product.

Warning: mind you, only Elsa "..I hear does a fairly good job.." at constructing a symbol table, which according to Ira Baxter is necessary for OP's original intent (see comments to this answer - I quote him because he is an expert in the field).

MaD70
Continuing the commentary, note that the OP wanted a simple way to extract some type information and generate some stuff from that. Handling that with complete c++ parsers is way too time consuming and unnecessary, especially as the cost for parser -> xml -> c++ will not be paid at runtime. Apart from that, nice list.
Georg Fritzsche
You underestimate how computationally costly is parsing XML. As Ira Baxter noted (he is an expert in the field) "*You are right; regex can't even come close to this*" and he means (Ira correct me if I'm wrong) that an half-baked solution will not be useful. Parsing C++ is notoriously hard and without fully parsing it I expect wrong/missing results in whatever extraction process at which you submit the code.
MaD70
You not only have to parse, but you need to build up the symbol table. And this is a bitch; the rules for this occupy the bulk of the 600 page reference manual. ANTLR-based C++ parsers, OpenC++, Stratego, don't do this. Willink's thesis is mildly interesting but I don't know of anybody that used its results in anger. Elsa I hear does a fairly good job. I think Clang says their C++ parser is incomplete at this point. GCC-XML does a good job if all you want is type data and you don't mind the tons of XML that it produces. DMS does this and produces function body information, too.
Ira Baxter
".. 600 page reference manual" obscene! I was not aware of such difficulties in constructing the symbol table of a C++ program. Thanks for the information. Introducing this accidental complexity in a programming language is folly in my (not so humble) opinion. I know that Clang C++ parser is incomplete, but they seem to progress at a fast pace, so I included it for future reference. Of course I mind the tons of XML that it produces (XML is a problem, not a solution). That's my motivation in suggesting other starting points than GCC-XML.
MaD70
Anyway I'm introducing a warning in my original answer about not all of my suggestions being complete for the OP's intent.
MaD70
MaD70, we might have a misunderstanding here - for the given points the OP mentioned gccxml should output sufficient information and it can be put to use in a matter of minutes which i doubt will be possible with full C++ front-ends. GCC to XML to custom app might not be the cheapest thing to do - but if its only meant to assist in a build-enviroment or with some simple code generation the computational cost doesn't matter compared to the development time for the perfect tool for the specific job.
Georg Fritzsche
Not misunderstanding, it was simply **my ignorance** : I was not aware that constructing a symbol table for a C++ program is so difficult also. I thought that the greatest difficulties were in parsing. That GCC-XML do this well is a misfortune: XML (and SGML descendants) are an abomination and GCC-XML is another **useful** tool that **forces** to use it.
MaD70
@MaD70: XML isn't *that* bad -- I mean, it's not nearly as clean as JSON, but it sure beats e.g. the old binary Word formats...
SamB