Start by scanning scopes. You need to count open braces { and close braces } as you work your way through the file, so that you know which scope you are in. You also need to parse // and /* ... */ as you scan the file, so you can tell when something is in a comment rather than being real code. There's also #if, but you would have to compile the code to know how to interpret these.
Then you need to parse the text immediately prior to some scope open braces to work out what they are. Your functions may be in global scope, class scope, or namespace scope, so you have to be able to parse namespaces and classes to identify the type of scope you are looking at. You can usually get away with fairly simple parsing (most programmers use a similar style - for example, it's uncommon for someone to put blank lines between the 'class Fred' and its open brace. But they might write 'class Fred {'. There is also the chance that they will put extra junk on the line - e.g. 'template class __DECLSPEC MYWEIRDMACRO Fred {'. However, you can get away with a pretty simple "does the line contain the word 'class' with whitespace on both sides? heuristic that will work in most cases.
OK, so you now know that you are inside a namepace, and inside a class, and you find a new open scope. Is it a method?
The main identifying features of a method are:
- return type. This could be any sequence of characters and can be many tokens ("__DLLEXPORT const unsigned myInt32typedef * &"). Unless you compile the entire project you have no chance.
- function name. A single token (but watch out for "operator =" etc)
- an pair of brackets containing zero or more parameters or a 'void'. This is your best clue.
- A function declaration will not include certain reserved words that will precede many scopes (e.g. enum, class, struct, etc). And it may use some reserved words (template, const etc) that you must not trip over.
So you could search up for a blank line, or a line ending in ; { or } that indicates the end of the previous statement/scope. Then grab all the text between that point and the open brace of your scope. Then extract a list of tokens, and try to match the parameter-list brackets. Check that none of the tokens are reserved words (enum, struct, class etc).
This will give you a "reasonable degree of confidence" that you have a method. You don't need much parsing to get a pretty high degree of accuracy. You could spend a lot of time finding all the special cases that confuse your "parser", but if you are working on a reasonably consistent code-base (i.e. just your own company's code) then you'll probably be able to identify all the methods in the code fairly easily.