tags:

views:

196

answers:

8

Hi all,

I have an interesting problem. I need to analyse a source code and to determine the types of variables before it is compiled. So, Reflection cannot be used!

There are only five types:

double      x = 1.23;
long        x = 3;
string     s='Hello World!' 
bool        b=true 
object[]     A = [1, 1+2, 'Hello', s]

An example of cource code:

for (i=0; i < 5; i++)
{
    a=2;
    b=4;
    c=6;
    tesstClass.Str = 'sss';
}

I decided to use regular expressions to solve the problem.

First, I find all pieces of code with the desirable variable (expressions with it) as follows:

string pattern = variable + @"[\w.]*\s*[-*+/]?=\s*[\w\s+'*/-]*\s*;";
MatchCollection mc = Regex.Matches(code, pattern);

Second, I analyse each Match using 5 regular expressions (one for each type):

string stringPattern = @"'[^'\r\n]*'"; //String;
string doublePattern = @"\b[0-9]+\.[0-9]+\b"; //Double
string longPattern = @"[-+]?\b\d+\b"; // Integer with a sign
string boolPattern = @"\b(false|true)\b"; // Boolean
string arrayPattern = @"\[([\w']*\s*,?\s*)*\]"; // Array

I am very bad in regular expressions. So I've defined a set of very simple r. expressions. Can you help me to refine them.

+3  A: 

What you are attempting to do is very difficult, if not impossible with regular expressions, especially as you have support for string constructs. What happens if I do this:

a = 'b = 3;';

I.e. in this case you would need to escape the string for your regular expression to work.

You really need to perform proper parsing of your code before you are going to be able to perform any meaningful analysis.

Kragen
According to my regular expressions, this variable is a string and an integer. =) That is why I am asking for help.
This is why using regular expressions is a bad idea - its is going to be extremely difficult to accurately escape strings. Although it is probably technically possible to do, the regular expression will end up looking like a mess!
Kragen
The possible solution can be to find all strings ("blabla" or "bool a;") and to replace them with some standard text (like "text").
A: 

Don't use regular expressions. What you are doing is type deduction, and I'm guessing you're doing it for school. They will want you to learn another way, such as logic unification.

You're relying on all the types always being obviously different. What if 0 or 1 is assigned to a Boolean? Regular expressions aren't good at reducing input. Your program will, at best, result in a list of identifiers of each type. There are better approaches.

If you are in a commercial environment, your solution will be totally non-scalable, unmaintainable, unreliable, and slow to implement, as by your own admission, this isn't your strong suit.

If you aren't just doing homework, you should have access to a parser for this language. If you don't, you should start one with a parser generator such as Bison (or whatever is in style).

If you are doing homework, I strongly suggest hitting the book.


Edit: I forgot to say what to do with Bison :vP . You have a data structure for each variable. It should contain a set of possible types. Say, an unsigned int with one bit representing each type, enum type_bits { double_bit = 1, long_bit = 2, string_bit = 4, … };. Begin by setting all the bits to 1; i.e. type_map = (type_bits) -1;. As you encounter each operation, mask out the bits that are incompatible with it. When you're done, you will have some number of bits set. Apply precedence rules if there is more than one, generate an error if there are none.

Potatoswatter
+1  A: 

What to find exactly?

Do you have to find only the literals (constants) or the whole declaration? It's ok to use expressions to find literals but its a little more complicated to parse the entire code

Give a chance to grammars

If you have to parse all the code... Dou you know grammar analyzers? When I studied 'language theory' we used grammars for parsing code. You can define a basic analizer with regex for the tokens (constants, reserved words, symbols, etc) and use a grammar analyzer for all the structure.

A Java option is JavaCC. There must be a .Net option.

Basically a grammar analyzer can parse complex structures (and have 'memory'). If a finite-state-automat is equivalent to a regex, a FSA with stack (it is memory) is equivalent to a grammar. It has more processing power.

helios
+2  A: 

I also doubt that regular expressions are suitable for this.

As Kragen has demonstrated, there are cases where regular expressions will match some piece of source code, but they will ignore the context in which that bit of source code appears. This can lead to errors. While it might be possible to write smarter regular expressions for such cases as Kragen showed, they will quickly become extremely complex and hard to read/maintain/understand, because they have to consider many different possible contexts.

I'd prefer writing a parser using a parser generator (such as Yacc or Bison). But depending on the language of your source code, that can also be quite tricky.

stakx
...but no so tricky. Parsers I think are the right tool for understanding the structure of an input code and transforming it into an object tree representation. Next you can do whatever you want. Of course, the grammar must be carefully edited to avoid ambiguity. But it's ok.
helios
Yes, avoiding ambiguity was what I meant with "tricky". (Ever read about people trying to implement a parser for C++? :) But apart from that, a parser generator would certainly be a suitable tool to use in this case.
stakx
A: 

Have you tried looking at the source code for the mono projects C# compiler, that may have some ideas that you find useful.

svn co svn://anonsvn.mono-project.com/source/trunk/mcs

ilivewithian
+1  A: 

The normal way of doing this would be to get the AST of your program and then simply search for the variable declarations you need. Gramars as suggested are a nice way of generating such AST.

But, if you need to analyse your program on the fly you can't use this option because your code might have parse errors. In this case I feel your pain...

Your only option is to parse your source code and regular expressions might help a bit.

First, I would begin with a regex similar to this:

(double|long|string|bool|object)\s*(\[\s*\])?\s+(YOUR_VARIABLE_TOKEN)

obs: YOUR_VARIABLE_TOKEN is missing because the variable has strong and defined rules about how it can be constructed for each language.

I didn't test this regex and it certainly isn't perfect. It was just to give you an idea.

Second, you would have to validate these matches with certain exception cases. For instance:

  1. The declaration might be inside a String literal : "bool a;"
  2. The declaration might be inside a comment : /* bool a; */

Also, this is not a very strange request. Eclipse does this kind of evaluation too in some cases like indenting.

This is not an easy task though, specially, finding those exception cases. Good Luck.

bruno conde
Hi, I wanted to use AST first, but as you told, the code is not completed and it can have some parse errors. The possible solution can be to find all strings ("blabla" or "bool a;") and to replace them with some standard text (like "text"). The same for comments, they can be deleted at all.
+1  A: 

Given that this is .NET, you could consider using CodeDOM to get it parsed properly.

Use the pre-existing C# CodeDOM provider to get a structured representation of your source code by using the Parse method, then traverse it. This allows you to make a solution that can work for pretty much ANY .NET language.

Even though you said it had to be done before compilation, you might be able to use a temporary in-memory compilation, which you can then work with using reflection. The CodeDOM provider can also help you there.

Michael Madsen
+1  A: 

The only way is to use AST. Regex is a bad sollution.

The Man