views:

662

answers:

4

I have a program which looks in source code, locates methods, and performs some calculations on the code inside of each method. I am trying to use regular expressions to do this, but this is my first time using them in C# and I am having difficulty testing the results.

If I use this regular expression to find the method signature:

((private)|(public)|(sealed)|(protected)|(virtual)|(internal))+([a-z]|[A-Z]|[0-9]|[\s])*([\()([a-z]|[A-Z]|[0-9]|[\s])*([\)|\{]+)

and then split the source code by this method, storing the results in an array of strings:

string[] MethodSignatureCollection = regularExpression.Split(SourceAsString);

would this get me what I want, ie a list of methods including the code inside of them?

+7  A: 

I would strongly suggest using Reflection (if it is appropriate) or CSharpCodeProvider.Parse(...) (as recommended by rstevens)

It can be very difficult to write a regular expression that works in all cases.

Here are some cases you'd have to handle:

public /* comment */ void Foo(...)      // Comments can be everywhere
string foo = "public void Foo(...){}";  // Don't match signatures in strings 
private __fooClass _Foo()               // Underscores are ugly, but legal
private void @while()                   // Identifier escaping
public override void Foo(...)           // Have to recognize overrides
void Foo();                             // Defaults to private
void IDisposable.Dispose()              // Explicit implementation

public // More comments                 // Signatures can span lines
    void Foo(...)

private void                            // Attributes
   Foo([Description("Foo")] string foo) 

#if(DEBUG)                              // Don't forget the pre-processor
    private
#else
    public
#endif
    int Foo() { }

Notes:

  • The Split approach will throw away everything that it matches, so you will in fact lose all the "signatures" that you are splitting on.
  • Don't forget that signatures can have commas in them
  • {...} can be nested, your current regexp could consume more { than it should
  • There is a lot of other stuff (preprocessor commands, using statements, properties, comments, enum definitions, attributes) that can show up in code, so just because something is between two method signatures does not make it part of a method body.
Daniel LeCheminant
A: 

No, those access modifiers can also be used for internal classes and fields, among other things. You'd need to write a full C# parser to get it right.

You can do what you want using reflection. Try something like the following:

  var methods = typeof (Foo).GetMethods();
  foreach (var info in methods)
  {
    var body = info.GetMethodBody();
  }

That probably has what you need for your calculations.

If you need the original C# source code you can't get it with reflection. Don't write your own parser. Use an existing one, listed here.

RossFabricant
A: 

It is feasible, I guess, to get something working using regex's, however this does require looking very carefully at the specifications for the C# language and a deep understanding of the C# grammar, this is not a simple problem. I know you've said you want to store the methods as arrays of strings, but presumably there is something beyond that. It has already been pointed out to look at using reflection, however if that does not do what you want, you should consider ANTLR (ANother Tool for Language Recognition). ANTLR does have C# grammars available.

http://www.antlr.org/about.html

wentbackward
Actually, ordinary regexes can't solve this problem because they can't count. (Perl "regexes" are Turing complete, though.)
RossFabricant
+3  A: 

Maybe it is a better approach to use the CSharpCodeProvider.Parse() which can "compile" C# source code into a CompileUnit. You can then walk through the namespaces, types, classes and methods of in that Compile Unit.

rstevens
+1 for CSharpCodeProvider
Daniel LeCheminant