views:

178

answers:

5

I have a program, written in C#, that when given a C++ or C# file, counts the lines in the file, counts how many are in comments and in designer-generated code blocks. I want to add the ability to count how many functions are in the file and how many lines are in those functions. I can't quite figure out how to determine whether a line (or series of lines) is the start of a function (or method).

At the very least, a function declaration is a return type followed by the identifier and an argument list. Is there a way to determine in C# that a token is a valid return type? If not, is there any way to easily determine whether a line of code is the start of a function? Basically I need to be able to reliably distinguish something like.

bool isThere() 
{
...
}

from

bool isHere = isThere()

and from

isThere()

As well as any other function declaration lookalikes.

A: 

Is there a way to determine in C# that a token is a valid return type?

You can determine that it's either a return type or an error pretty easily (by making sure it's not anything else that could be in that position). And you probably don't need to guarantee "correct" behaviour on invalid code.

Then you look for the parentheses.

Anon.
Why the downvote?
Anon.
It wasn't me, but probably because "not anything else in that position" covers several thousand different things, and in many cases you will have to parse quite a bit of the surrounding code to work it out. How can you tell if "MyThing" is a return type? e.g. It could be a macro that expands to "class Thing {". How can you tell if the return type you have found is not in a comment?
Jason Williams
+2  A: 

The problem with doing this is to do it accurately, you must take into account all of the possible ways a C# function can be defined. In essence, you need to write a parser. Doing so is beyond the scope of a simple SO answer.

There will likely be a lot of answers to this question in the form of regex's and they will work for common cases but will likely blow up in corner cases like the following

int
?
/* this 
is */
main /* legal */ (code c) { 
}
JaredPar
Bear in mind that C++ is hard to parse, although I think C# is a lot better.
David Thornley
@David, C# is hard to parse, C++ is nearly impossible :)
JaredPar
Jason Williams
+1  A: 

I'd probably use a regular expression, though given the number of datatypes and declaration options and user defined types/clases, it would be non-trivial. To simply avoid capturing assignments from function calls, you might start with a Regex (untested) like:

(private|public|internal|protected|virtual)?\s+(static)?\s+(int|bool|string|byte|char|double|long)\s+([A-Za-z][A-Za-z_0-9]*)\s*\(

This doesn't (by a long shot) catch everything, and you'd need to tune it up.

Another approach could involve reflection to determine function declarations, but that's probably not appropriate when you want to do static source code analysis.

theraccoonbear
+1  A: 

Start by scanning scopes. You need to count open braces { and close braces } as you work your way through the file, so that you know which scope you are in. You also need to parse // and /* ... */ as you scan the file, so you can tell when something is in a comment rather than being real code. There's also #if, but you would have to compile the code to know how to interpret these.

Then you need to parse the text immediately prior to some scope open braces to work out what they are. Your functions may be in global scope, class scope, or namespace scope, so you have to be able to parse namespaces and classes to identify the type of scope you are looking at. You can usually get away with fairly simple parsing (most programmers use a similar style - for example, it's uncommon for someone to put blank lines between the 'class Fred' and its open brace. But they might write 'class Fred {'. There is also the chance that they will put extra junk on the line - e.g. 'template class __DECLSPEC MYWEIRDMACRO Fred {'. However, you can get away with a pretty simple "does the line contain the word 'class' with whitespace on both sides? heuristic that will work in most cases.

OK, so you now know that you are inside a namepace, and inside a class, and you find a new open scope. Is it a method?

The main identifying features of a method are:

  • return type. This could be any sequence of characters and can be many tokens ("__DLLEXPORT const unsigned myInt32typedef * &"). Unless you compile the entire project you have no chance.
  • function name. A single token (but watch out for "operator =" etc)
  • an pair of brackets containing zero or more parameters or a 'void'. This is your best clue.
  • A function declaration will not include certain reserved words that will precede many scopes (e.g. enum, class, struct, etc). And it may use some reserved words (template, const etc) that you must not trip over.

So you could search up for a blank line, or a line ending in ; { or } that indicates the end of the previous statement/scope. Then grab all the text between that point and the open brace of your scope. Then extract a list of tokens, and try to match the parameter-list brackets. Check that none of the tokens are reserved words (enum, struct, class etc).

This will give you a "reasonable degree of confidence" that you have a method. You don't need much parsing to get a pretty high degree of accuracy. You could spend a lot of time finding all the special cases that confuse your "parser", but if you are working on a reasonably consistent code-base (i.e. just your own company's code) then you'll probably be able to identify all the methods in the code fairly easily.

Jason Williams
A: 

If you want to write a real parser (I know you might not want to) then try ANTLR. If nothing else it will be a fun project

pm100