tags:

views:

275

answers:

4

Hi, Could somebody help me parse following from the C# method declaration: scope, isStatic, name, return type and list of the parameters and their types. So given method declaration like this

public static SomeReturnType GetSomething(string param1, int param2)

etc. I need to be able to parse it and get the info above. So in this case

  • name = "GetSomething"
  • scope = "public"
  • isStatic = true
  • returnType = "SomeReturnType"

and then array of parameter type and name pairs.

Oh almost forgot the most important part. It has to account for all other scopes (protected, private, internal, protected internal), absence of "static", void return type etc.

Please note that REFLECTION is not solution here. I need REGEX.

So far I have these two:

 (?:(?:public)|(?:private)|(?:protected)|(?:internal)|(?:protected internal)\s+)*

(?:(?:static)\s+)*

I guess for rest of the problem I can just get away with string manipulation without regex.

A: 

Well given the rules you've provided, it would probably be best to use a series of regular expressions rather than trying to come up with a singular expression. That expression would be enormous.

If you're sold on a singular expression, you'll need to use a regular expression that uses grouping, look-ahead and look-behind.

http://www.regular-expressions.info/lookaround.html

Even with the limited scope of what you're trying to parse out of it, you'll still need some very specific guidelines on all possibilities.

Joel Etherton
I am fine with multiple regex.
epitka
A: 

I wouldn't bother with using Regex. When you get to the part of interpreting method parameters, it gets really messy (ref and out keywords for example). I don't know if you need support for attribute notation as well, but that would make it a complete mess.

Maybe a C# parser library can be of help. I've found a few on the internet:

Alternatively, you could first feed the code to the compiler at runtime, and then use reflection on the newly created assembly. It will be slower, but pretty much guaranteed to be correct. Even though you seem to be opposed to the idea of using reflection, this can be a viable solution.

Something like this:

List<string> referenceAssemblies = new List<string>()
{
    "System.dll"
    // ...
};

string source = "public abstract class TestClass {" + input + ";}";

CSharpCodeProvider codeProvider = new CSharpCodeProvider();

// No assembly name specified
CompilerParameters compilerParameters =
    new CompilerParameters(referenceAssemblies.ToArray());
compilerParameters.GenerateExecutable = false;
compilerParameters.GenerateInMemory = false;

CompilerResults compilerResults = codeProvider.CompileAssemblyFromSource(
    compilerParameters, source);

// Check for successful compilation here

Type testClass = compilerResults.CompiledAssembly.GetTypes().First();

Then use reflection on testClass.

Compiling should be safe without input validation, because you're not executing any of the code. You'd only need very basic checks, such as making sure only 1 method signature is entered.

Thorarin
'ref' and 'out' are not supported either.
epitka
+5  A: 

Some thoughts on your problem:

A set of strings that can all be matched by a particular regular expression is called a regular language. The set of strings which are legal method declarations is not a regular language in any version of C#. If you are attempting to find a regular expression which matches every legal C# method declaration and rejects every illegal C# method declaration then you are out of luck.

More generally, regular expressions are almost always a bad idea for anything but the simplest matching problems. (Sorry Jeff.) A far better approach is to first write a lexer, which breaks up the string into a sequence of tokens. Then analyze the token sequence. (Using regular expressions as part of a lexer is not a terrible idea, though you can get by without them.)

I note also that you are glossing over rather a lot of complications in parsing method declarations. You did not mention:

  • generic/array/pointer/nullable return and formal parameter types
  • generic type parameter declarations
  • generic type parameter constraints
  • unsafe/extern/new/override/virtual/abstract/sealed methods
  • explicit interface implementation methods
  • method/parameter/return attributes
  • partial methods -- slightly tricky to parse, partial is a contextual keyword
  • comments

I also note that you've not said whether you are guaranteed that the method signature is already good, or if you need to identify bad ones and produce diagnostics as to why they're bad. That's a much harder problem.

Why do you want to do this in the first place? Doing this correctly is rather a lot of work. Perhaps there is an easier way to get what you want?

Eric Lippert
+1 for the lexer :)
GalacticCowboy
Thanks Eric for such detailed post on the problems that I might run into, but none of the complications you mention need to be handled at all, as they are just invalid in this scenario (I clarified this in the comments area, my bad). I don't need to guarantee that the method signature is good at all, as it will fail at compile time. I think that with these two regexes above an little string spliting I'll be ok.
epitka
As for easier way, I don't think, other then creating a dummy class, compiling it in separate app domain and then doing reflection on it, etc.., but that is just overkill for what I need here. And all this needs to work in web environment in medium trust.
epitka
So what you're saying, @epitka, is that you are defining your own subset of C# and writing a recognizer for that. Since neither I nor anyone else reading this knows what the specification of your new language is, it'll be hard for us to help you write a recognizer for it. My advice is that you start by writing a *grammar* of the language you intend to recognize, prove that it is a subset of the C# language, and then write a recognizer of your new grammar.
Eric Lippert
@epitka: If those inputs are *invalid* in your scenario then you have a choice to make. Are you going to (1) detect the invalid inputs and give an error, or (2) allow user-supplied data that you know to be invalid to get into your system? The latter seems dangerous; I prefer to have a recognizer in place that recognizes and rejects invalid inputs rather than trucking along and hoping for the best. If you choose the safer option then you'll need to write a recognizer that can separate valid from invalid inputs; again, you need to write a grammar for the set of valid inputs.
Eric Lippert
I would let the invalid input in. Here is a little background, I am building a meta data driven modeling tool, that will basically generate everything for user based on the meta model (UI, domain, dao, NH mapping etc). One of the targets is a C#, and for that user has to obey by some simple rules if he want his code to be compiled successfully at this stage. There are so many more other things that I need to do, that writing a recognizer would just set me back to much, so I am ok with invalid stuff coming in, as I can just ignore it or let C# compiler bark at them later on.
epitka
A: 
string test = @"public static SomeReturnType GetSomething(string param1, int param2)";
var match = Regex.Match(test, @"(?<scope>\w+)\s+(?<static>static\s+)?(?<return>\w+)\s+(?<name>\w+)\((?<parms>[^)]+)\)");
Console.WriteLine(match.Groups["scope"].Value);
Console.WriteLine(!string.IsNullOrEmpty(match.Groups["static"].Value));
Console.WriteLine(match.Groups["return"].Value);
Console.WriteLine(match.Groups["name"].Value);
List<string> parms = match.Groups["parms"].ToString().Split(',').ToList();
parms.ForEach(x => Console.WriteLine(x));
Console.Read();

Broken for parms with commas, but it's quite possible to also handle that.

Paul Creasey