tags:

views:

94

answers:

5

I need to validate the contents of a C# method.

I do not care about syntax errors that do not affect the method's scope.

I do care about characters that will invalidate parsing of the rest of the code. For example:

method()
{
  /* valid comment */
  /*           <-- bad
  for (i..) {
  } 
  for (i..) {  <-- bad
}

I need to validate/fix any non-paired characters.

This includeds /* */, { }, and maybe others.

How should I go about this?

My first thought was Regex, but that clearly isn't going to get the job done.

+1  A: 

A regex is certainly not the answer to this problem. Regex's are useful tools for certain types of data validation. But once you get into the business of more complicated data like matching braces or comment blocks a regex no longer gets the job done.

Here is a blog article on the limitations encountered when using a regex to validate input.

In order to do this you will have to write a parser of sorts which does the validation.

JaredPar
A: 

If you're trying to "validate" the contents of a string defining a method, then you may be better off just trying to use the CodeDom classes and compile the method on the fly into an in memory assembly.

Writing your own fully-functional parser to do validation will be very, very difficult, especially if you want to support C# 3 or later. Lambda expressions and other constructs like that will be very difficult to "validate" cleanly.

Reed Copsey
A: 

You're drawing a false dichotomy between "characters that will invalidating parsing the rest of the code" and "syntax errors". Lacking a closing curly brace (one of the problems you mention) is a syntax error. It looks like you mean you're looking for syntax errors that potentially break scope boundaries? Unfortunately, there's no robust way to do this short of using a full parser.

As an example:

method()
{ <-- is missing closing brace
  /* valid comment */
  /*           <-- bad
  for (i..) {
  } 
  for (i..) {  
} <-- will be interpreted as the closing brace for the for loop

There's no general, practical way to infer that it's the for loop that's missing its closing brace, rather than the method.

If you're really interested in looking for these sort of things, you should consider running the compiler programmatically and parsing the results - that's the best approach with the lowest entry threshold.

Dathan
The method's braces are not part of what I'm validating. Only the contents are in the string that I need to validate
jaws
I've thought about this particular issue before, and see no reason why indentation could not be used to determine that the last brace was intended for the method.
Simon Buchan
@user258651 Apologies - I must have misunderstood your code sample, then. What did you mean by the "<-- bad" on the second for loop?
Dathan
@Simon You're probably right, as the goal seems to be an informative tool rather than one that needs strict correctness, and most code is nicely indented these days.
Dathan
@Dathan "for (i..) { <-- bad" is bad because there is no } to match. I'm only interested in validating the contents of the method, everything inside (but not including) the method's curly brackets.
jaws
+1  A: 

A regular expression isn't a very convenient thing for such a task. This is often implemented using a stack with an algorithm like the following:

  1. Create an empty stack S.
  2. While( there are characters left ){
  3. Read a character ch.
  4. If is ch an opening paren (of any kind), push it onto S
  5. Else
  6. If ch is a closing paren (of any kind), look at the top of S.
  7. If S is empty as this point, report failure.
  8. If the top of S is the opening paren that corresponds to c, then pop S and continue to 1, this paren matches OK.
  9. Else report failure.
  10. If at the end of input the stack S is not empty, return failure. Else return success.

for more information check http://www.ccs.neu.edu/home/sbratus/com1101/lab4.html and http://codeidol.com/csharp/csharpckbk2/Data-Structures-and-Algorithms/Determining-Where-Characters-or-Strings-Do-Not-Balance/

olle
Step #4 is incorrect because you have to account for comments. The following code would register as valid but is in fact invalid: foo ( // )
JaredPar
regex helps to ignore things in comments or strings, though - you can match the whole thing in one go, no parser state needed.
Simon Buchan
@JaredPar: A line comment is just a "//" opening matched with a "\n" closing. It is no different to block comments or strings.
Simon Buchan
@Simon the algorithm mentioned makes no attempt to ignore comments so their implementation is incorrect.
JaredPar
@jaredpar you are absolutely right. I do think it contains the general idea I left adding in the edge cases as an exercise to the reader... ;)
olle
+3  A: 

You'll need to scope your problem more carefully in order to get a sensible answer.

For example, what are you going to do about methods that contain preprocessor directives?

void M()
{

#if FOO
    for(foo;bar;blah) {
#else
    while(abc) {
#endif
        Blah();
    }
}

This is silly but legal, so you have to handle it. Are you going to count that as a mismatched brace or not?

Can you provide a detailed specification of exactly what you want to determine? As we've seen several times on this site, people cannot successfully build a routine that divides two numbers without a specification. You're talking about analysis that is far more complex than dividing two numbers; the code which does what you're describing in the actual compiler is tens of thousands of lines long.

Eric Lippert
I don't need an absolute solution. What I am trying to provide is a simple indicator that there may be an error in the code. 1. If the code contains preproccesor statements, do nothing. This is a corner case for my users, and an order of magnitude more difficult to solve.2. If the code contains an unmatched { or /*, highlight itThat's not much detail, but it states exactly what I need.
jaws
@user258651: OK, what about strings? Suppose your code has a mismatched brace but contains *Console.WriteLine("}}}}}");* -- do those count as brace ends or not?
Eric Lippert
@user258651: And what if a comment contains a brace? Do you count that as an end brace or not?
Eric Lippert
@user258651: Also, I note that "if the code contains preproc directives then do nothing" logically requires that *you have to write a detector for preprocessor directives*, otherwise you don't know whether to do nothing or not.
Eric Lippert
You're right, on all counts. This is something I'll have to put a good deal of research time into. The simple solution I was looking for simply doesn't exist.
jaws