views:

125

answers:

3

I need to parse and split C and C++ functions into the main components (return type, function name/class and method, parameters, etc).

I'm working from either headers or a list where the signatures take the form:

public: void __thiscall myClass::method(int, class myOtherClass * )

I have the following regex, which works for most functions:

(?<expo>public\:|protected\:|private\:) (?<ret>(const )*(void|int|unsigned int|long|unsigned long|float|double|(class .*)|(enum .*))) (?<decl>__thiscall|__cdecl|__stdcall|__fastcall|__clrcall) (?<ns>.*)\:\:(?<class>(.*)((<.*>)*))\:\:(?<method>(.*)((<.*>)*))\((?<params>((.*(<.*>)?)(,)?)*)\)

There are a few functions that it doesn't like to parse, but appear to match the pattern. I'm not worried about matching functions that aren't members of a class at the moment (can handle that later). The expression is used in a C# program, so the <label>s are for easily retrieving the groups.

I'm wondering if there is a standard regex to parse all functions, or how to improve mine to handle the odd exceptions?

+1  A: 

No. Even function prototypes can have arbitrary levels of nesting, so cannot be expressed with a single regular expression.

If you really are restricting yourself to things very close to your example (exactly 2 arguments, etc.), then could you provide an example of something that doesn't match?

Oli Charlesworth
The .NET regex flavor is the only flavor so far that *can* have arbitrary levels of nesting. But using it for this type of job, I wouldn't do it.
Abel
@Abel: The .Net regex "flavor" is recursive and does backtracking (thus it is not a DFA). This was copied from Perl, which exists since many and many years before .Net was invented.
Juliano
@Juliano, not sure what your point is. Indeed, both Perl and .NET are NFA, as are most regex flavors. .NET never made a secret that it follows Perl-syntax closely. But .NET introduced [balancing groups](http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#balancing_group_definition) which is what I meant with *arbitrary levels of nesting*. It keeps track of the number of opened vs closed parentheses (or anything) and only succeeds the match if the opening and closing pairs are of the same amount and paired. So far, afaik, .NET is the only flavor that supports this.
Abel
+3  A: 

C++ is notoriously hard to parse; it is impossible to write a regex that catches all cases. For example, there can be an unlimited number of nested parentheses, which shows that even this subset of the C++ language is not regular.

But it seems that you're going for practicality, not theoretical correctness. Just keep improving your regex until it catches the cases it needs to catch, and try to make it as stringent as possible so you don't get any false matches.

Without knowing the "odd exceptions" that it doesn't catch, it's hard to say how to improve the regex.

Thomas
A: 

Take a look at Boost.Spirit, it is a boost library that allows the implementation of recursive descent parsers using only C++ code and no preprocessors. You have to specify a BNF Grammar, and then pass a string for it to parse. You can even generate an Abstract-Syntax Tree (AST), which is useful to process the parsed data.

The BNF specification looks like for a list of integers or words separated might look like :

using spirit::alpha_p;
using spirit::digit_p;
using spirit::anychar_p;
using spirit::end_p;
using spirit::space_p;

// Inside the definition...
integer    = +digit_p;                      // One or more digits.
word       = +alpha_p;                      // One or more letters.
token      = integer | word;                // An integer or a word.
token_list = token >> *(+space_p >> token)  // A token, followed by 0 or more tokens.

For more information refer to the documentation, the library is a bit complex at the beginning, but then it gets easier to use (and more powerful).

jbernadas