views:

25612

answers:

7

I'm working on a completion (intellisense) facility for C# in emacs.

The idea is, if a user types a fragment, then asks for completion via a particular keystroke combination, the completion facility will use .NET reflection to determine the possible completions.

Doing this requires that the type of the thing being completed, be known. If it's a string, there's a known set of possible methods and properties; if it's an Int32, it has a separate set, and so on.

Using semantic, a code lexer/parser package available in emacs, I can locate the variable declarations, and their types. Given that, it's straightforward to use reflection to get the methods and properties on the type, and then present the list of options to the user. (Ok, not quite straightforward to do within emacs, but using the ability to run a powershell process inside emacs, it becomes much easier. I write a custom .NET assembly to do reflection, load it into the powershell, and then elisp running within emacs can send commands to powershell and read responses, via comint. As a result emacs can get the results of reflection quickly.)

The problem arrives when the code uses var in the declaration of the thing being completed. That means the type is not explicitly specified, and completion won't work.

How can I reliably determine the actual type used, when the variable is declared with the var keyword? Just to be clear, I don't need to determine it at runtime. I want to determine it at "Design time".

So far I have these ideas:

  1. compile and invoke:
    • extract the declaration statement, eg `var foo = "a string value";`
    • concatenate a statement `foo.GetType();`
    • dynamically compile the resulting C# fragment it into a new assembly
    • load the assembly into a new AppDomain, run the framgment and get the return type.
    • unload and discard the assembly

    I know how to do all this. But it sounds awfully heavyweight, for each completion request in the editor.

    I suppose I don't need a fresh new AppDomain every time. I could re-use a single AppDomain for multiple temporary assemblies, and amortize the cost of setting it up and tearing it down, across multiple completion requests. That's more a tweak of the basic idea.

  2. compile and inspect IL

    Simply compile the declaration into a module, and then inspect the IL, to determine the actual type that was inferred by the compiler. How would this be possible? What would I use to examine the IL?

Any better ideas out there? Comments? suggestions?


EDIT - thinking about this further, compile-and-invoke is not acceptable, because the invoke may have side effects. So the first option must be ruled out.

Also, I think I cannot assume the presence of .NET 4.0.


UPDATE - The correct answer, unmentioned above, but gently pointed out by Eric Lippert, is to implement a full fidelity type inference system. It;s the only way to reliably determine the type of a var at design time. But, it's also not easy to do. Because I suffer no illusions that I want to attempt to build such a thing, I took the shortcut of option 2 - extract the relevant declaration code, and compile it, then inspect the resulting IL.

This actually works, for a fair subset of the completion scenarios.

For example, suppose in the following code fragments, the ? is the position at which the user asks for completion. This works:

var x = "hello there"; 
x.?

The completion realizes that x is a String, and provides the appropriate options. It does this by generating and then compiling the following source code:

namespace N1 {
  static class dmriiann5he { // randomly-generated class name
    static void M1 () {
       var x = "hello there"; 
    }
  }
}

...and then inspecting the IL with simple reflection.

This also works:

var x = new XmlDocument();
x.? 

The engine adds the appropriate using clauses to the generated source code, so that it compiles properly, and then the IL inspection is the same.

This works, too:

var x = "hello"; 
var y = x.ToCharArray();    
var z = y.?

It just means the IL inspection has to find the type of the third local variable, instead of the first.

And this:

var foo = "Tra la la";
var fred = new System.Collections.Generic.List<String>
    {
        foo,
        foo.Length.ToString()
    };
var z = fred.Count;
var x = z.?

...which is just one level deeper that the prior example.

But, what doesn't work is completion on any local variable whose initialization depends at any point on an instance member, or local method argument. Like:

var foo = this.InstanceMethod();
foo.?

Nor LINQ syntax.

I'll have to think about how valuable those things are before I consider addressing them via what is definitely a "limited design" (polite word for hack) for completion.

An approach to addressing the issue with dependencies on method arguments or instance methods would be to replace, in the fragment of code that gets generated, compiled and then IL analyzed, the references to those things with "synthetic" local vars of the same type.


Another Update - completion on vars that depend on instance members, now works.

What I did was interrogate the type (via semantic), and then generate synthetic stand-in members for all existing members. For a C# buffer like this:

public class CsharpCompletion
{
    private static int PrivateStaticField1 = 17;

    string InstanceMethod1(int index)
    {
        ...lots of code here...
        return result;
    }

    public void Run(int count)
    {
        var foo = "this is a string";
        var fred = new System.Collections.Generic.List<String>
        {
            foo,
            foo.Length.ToString()
        };
        var z = fred.Count;
        var mmm = count + z + CsharpCompletion.PrivateStaticField1;
        var nnn = this.InstanceMethod1(mmm);
        var fff = nnn.?

        ...more code here...

...the generated code that gets compiled, so that I can learn from the output IL the type of the local var nnn, looks like this:

namespace Nsbwhi0rdami {
  class CsharpCompletion {
    private static int PrivateStaticField1 = default(int);
    string InstanceMethod1(int index) { return default(string); }

    void M0zpstti30f4 (int count) {
       var foo = "this is a string";
       var fred = new System.Collections.Generic.List<String> { foo, foo.Length.ToString() };
       var z = fred.Count;
       var mmm = count + z + CsharpCompletion.PrivateStaticField1;
       var nnn = this.InstanceMethod1(mmm);
      }
  }
}

All of the instance and static type members are available in the skeleton code. It compiles successfully. At that point, determining the type of the local var is straightforward via Reflection.

What makes this possible is:

  • the ability to run powershell in emacs
  • the C# compiler is really fast. On my machine, it takes about 0.5s to compile an in-memory assembly. Not fast enough for between-keystrokes analysis, but fast enough to support the on-demand generation of completion lists.

I haven't looked into LINQ yet.
That will be a much bigger problem because the semantic lexer/parser emacs has for C#, doesn't "do" LINQ.

A: 

For solution "1" you have a new facility in .NET 4 to do this quickly and easily. So if you can have your program converted to .NET 4 it would be your best choice.

Softion
+3  A: 

Intellisense systems typically represent the code using an Abstract Syntax Tree, which allows them to resolve the return type of the function being assigned to the 'var' variable in more or less the same way as the compiler will. If you use the VS Intellisense, you may notice that it won't give you the type of var until you've finished entering a valid (resolvable) assignment expression. If the expression is still ambiguous (for instance, it can't fully infer the generic arguments for the expression), the var type will not resolve. This can be a fairly complex process, as you might need to walk fairly deep into a tree in order to resolve the type. For instance:

var items = myList.OfType<Foo>().Select(foo => foo.Bar);

The return type is IEnumerable<Bar>, but resolving this required knowing:

  1. myList is of type that implements IEnumerable.
  2. There is an extension method OfType<T> that applies to IEnumerable.
  3. The resulting value is IEnumerable<Foo> and there is an extension method Select that applies to this.
  4. The lambda expression foo => foo.Bar has the parameter foo of type Foo. This is inferred by the usage of Select, which takes a Func<TIn,TOut> and since TIn is known (Foo), the type of foo can be inferred.
  5. The type Foo has a property Bar, which is of type Bar. We know that Select returns IEnumerable<TOut> and TOut can be inferred from the result of the lambda expression, so the resulting type of items must be IEnumerable<Bar>.
Dan Bryant
Right, it can get pretty deep. I'm comfortable with resolving all the dependencies. Just thinking about this, the first option I described - compile and invoke - is absolutely not acceptable, because invoking code can have side effects, like updating a database, and that's not something an editor ought to do. Compiling is ok, invoking is not. As far as building the AST, I don't think I want to do that. Really I want to defer that job to the compiler, which already knows how to do it. I want to be able to ask the compiler to tell me what I want to know. I just want a simple answer.
Cheeso
The challenge with inspecting it from compilation is that dependencies can be arbitrarily deep, which means you may need to build everything in order for the compiler to generate code. If you do that, I think you can use the debugger symbols with the generated IL and match up the type of each local with it symbol.
Dan Bryant
@Cheeso: the compiler does not offer that kind of type analysis as a service. I hope that in the future it will but no promises.
Eric Lippert
@Dan, yes, I think that might be the way to go - resolve all the dependencies and then compile and inspect IL. @Eric, good to know. For now if I don't aspire to do the complete AST analysis, so I must resort to a dirty hack to produce this service using existing tools. Eg, compile an intelligently-constructed fragment of code and then use ILDASM (or similar) programmatically to get the answer I seek.
Cheeso
+4  A: 

If you don't want to have to write your own parser to build the abstract syntax tree, you could look at using the parsers from either SharpDevelop or MonoDevelop, both of which are open source.

Daniel Plaisted
+150  A: 

I can describe for you how we do that efficiently in the "real" C# IDE.

The first thing we do is run a pass which analyzes only the "top level" stuff in the source code. We skip all the method bodies. That allows us to quickly build up a database of information about what namespace, types and methods (and constructors, etc) are in the source code of the program. Analyzing every single line of code in every method body would take way too long if you're trying to do it between keystrokes.

When the IDE needs to work out the type of a particular expression inside a method body -- say you've typed "foo." and we need to figure out what are the members of foo -- we do the same thing; we skip as much work as we reasonably can.

We start with a pass which analyzes only the local variable declarations within that method. When we run that pass we make a mapping from a pair of "scope" and "name" to a "type determiner". The "type determiner" is an object that represents the notion of "I can work out the type of this local if I need to". Working out the type of a local can be expensive so we want to defer that work if we need to.

We now have a lazily-built database that can tell us the type of every local. So, getting back to that "foo." -- we figure out which statement the relevant expression is in and then run the semantic analyzer against just that statement. For example, suppose you have the method body:

String x = "hello";
var y = x.ToCharArray();
var z = from foo in y where foo.

and now we need to work out that foo is of type char. We build a database that has all the metadata, extension methods, source code types, and so on. We build a database that has type determiners for x, y and z. We analyze the statement containing the interesting expression. We start by transforming it syntactically to

var z = y.Where(foo=>foo.

In order to work out the type of foo we must first know the type of y. So at this point we ask the type determiner "what is the type of y"? It then starts up an expression evaluator which parses x.ToCharArray() and asks "what's the type of x"? We have a type determiner for that which says "I need to look up "String" in the current context". There is no type String in the current type, so we look in the namespace. It's not there either so we look in the using directives and discover that there's a "using System" and that System has a type String. OK, so that's the type of x.

We then query System.String's metadata for the type of ToCharArray and it says that it's a System.Char[]. Super. So we have a type for y.

Now we ask "does System.Char[] have a method Where?" No. So we look in the using directives; we have already precomputed a database containing all of the metadata for extension methods that could possibly be used.

Now we say "OK, there are eighteen dozen extension methods named Where in scope, do any of them have a first formal parameter whose type is compatible with System.Char[]?" So we start a round of convertibility testing. However, the Where extension methods are generic, which means we have to do type inference.

I've written a special type infererencing engine that can handle making incomplete inferences from the first argument to an extension method. We run the type inferrer and discover that there is a Where method that takes an IEnumerable<T>, and that we can make an inference from System.Char[] to IEnumerable<System.Char>, so T is System.Char.

The signature of this method is Where<T>(this IEnumerable<T> items, Func<T, bool> predicate), and we know that T is System.Char. Also we know that the first argument inside the parentheses to the extension method is a lambda. So we start up a lambda expression type inferrer that says "the formal parameter foo is assumed to be System.Char", use this fact when analyzing the rest of the lambda.

We now have all the information we need to analyze the body of the lambda, which is "foo.". We look up the type of foo, we discover that according to the lambda binder it is System.Char, and we're done; we display type information for System.Char.

And we do everything except the "top level" analysis between keystrokes. That's the real tricky bit. Actually writing all the analysis is not hard; it's making it fast enough that you can do it at typing speed that is the real tricky bit.

Good luck!

Eric Lippert
Eric, thanks for the complete reply. You have opened my eyes quite a bit.For emacs, I was not aspiring to produce a dynamic, between-keystrokes engine that would compete with Visual Studio in terms of quality of the user experience. For one thing, because of the ~0.5s latency inherent in *my* design, the emacs-based facility is and will remain on-demand only; no type-ahead suggestions. For another - I will implement basic support of var locals, but I'll happily punt when things get hairy, or when the dependency graph exceeds a certain limit. Not sure what that limit is yet. Thanks again.
Cheeso
@Eric, why do you have to do that between keystroke? Can't you build a real ctags-style database, and update it based on the diff the current keystroke made?
Elazar Leibovich
... anxiously waiting for the code-golf questions to implement all of this in your language of choice... They're bound to appear!
Lasse V. Karlsen
@Elazar - I imagine it's because calculating the diffs and mutating complex structures is more expensive and more error prone than recalculating that information just from a method's body, given that you already have the top-level info. There is so much in common between a real compile and providing symbolic insight that it wouldn't make sense not to share much of the code; and if you can make that code fast enough for code insight, you also reap the benefit in straight-line compilation. Also, ctags is extremely primitive; proper symbolic insight relies on full type info.
Barry Kelly
It honestly boggles my mind that all of this can work so quickly and reliably, particularly with lambda expressions and generic type inference. I was actually quite surprised the first time I wrote a lambda expression and Intellisense knew the type of my parameter when I pressed ., even though the statement was not yet complete and I never explicitly specified the generic parameters of the extension methods. Thanks for this little peek into the magic.
Dan Bryant
@Dan: I've seen (or written) the source code and it boggles my mind that it works too. :-) There's some hairy stuff in there.
Eric Lippert
I still wonder how Eclipse is able to do this so much better and faster for Java, with C# being essentially an equivalent of it.
TomA
The Eclipse guys probably do it better because they are *more awesome* than the C# compiler and IDE team.
Eric Lippert
@Eric - indeed. As TomA points out, despite the fact that Java and C# are virtually identical languages, Eclipse's type-ahead features don't seem to be slowed down at all by analyzing lambdas, extension methods, or implicitly typed locals. Sheer awesomeness seems to be the only plausible explanation.
kvb
@TomA: I'll punt on faster, but better? What's better than 100%?
Ben M
@TomA: Interestingly Eclipse is way slower for all this for me. But maybe I'm just doing something wrong (like trying to write code with it ... maybe).
Joey
I don't remember making this stupid comment at all. It doesn't even make sense. I must have been drunk. Sorry.
TomA
+3  A: 

Since you are targeting Emacs, it may be best to start with the CEDET suite. All the details that Eric Lippert are covered in the code analyzer in the CEDET/Semantic tool for C++ already. There is also a C# parser (which probably needs a little TLC) so the only parts missing are related to tuning the necessary parts for C#.

The basic behaviors are defined in core algorithms that depend on overloadable functions that are defined on a per language basis. The success of the completion engine is dependent on how much tuning has been done. With c++ as a guide, getting support similar to C++ shouldn't be too bad.

Daniel's answer suggests using MonoDevelop to do parsing and analysis. This could be an alternative mechanism instead of the existing C# parser, or it could be used to augment the existing parser.

Eric
Right, I know about CEDET, and I am using the C# support in the contrib directory for semantic. Semantic provides the list of local variables and their types. A completion engine can scan that list and offer the right choices to the user. The problem is when the variable is `var`. Semantic correctly identifies it as var, but does not provide type inference. My question was specifically around how to address *that*. I also looked into plugging into the existing CEDET completion, but I couldn't figure out how. The documentation for CEDET is...ah...not complete.
Cheeso
Side comment - CEDET is admirably ambitious, but I found it hard to use and extend. Currently the parser treats "namespace" as a *class* indicator in C#. I couldn't even figure out how to add "namespace" as a distinct syntactic element. Doing so prevented all other syntactic analysis, and I couldn't figure out why. I previously explained the difficulty I had with the completion framework. Beyond these problems, there are seams and overlap between the pieces. As one example, navigation is part of both semantic and senator. CEDET seems enticing, but in the end... it's too unwieldy to commit to.
Cheeso
Cheeso, if you want to get the most out of the less documented parts of CEDET, your best bet is to try the mailing list. It is easy for questions to delve into areas that have not been well developed yet, so it takes a few iterations to work out good solutions, or to explain existing ones. For C# in particular, since I don't know anything about it, there will be no simple one off answers.
Eric
+1  A: 

It's a hard problem to do well. Basically you need to model the language spec/compiler through most of the lexing/parsing/typechecking and build an internal model of the source code which you can then query. Eric describes it in detail for C#. You can always download the F# compiler source code (part of the F# CTP) and take a look at service.fsi to see the interface exposed out of the F# compiler that the F# language service consumes for providing intellisense, tooltips for inferred types, etc. It gives a sense of a possible 'interface' if you already had the compiler available as an API to call into.

The other route is to re-use the compilers as-is as you're describing, and then use reflection or look at the generated code. This is problematic from the point of view that you need 'full programs' to get a compilation output from a compiler, whereas when editing source code in the editor, you often only have 'partial programs' that do not yet parse, don't have all methods implemented yet, etc.

In short, I think the 'low budget' version is very hard to do well, and the 'real' version is very, very hard to do well. (Where 'hard' here measures both 'effort' and 'technical difficulty'.)

Brian
Ya, the "low budget" version has some clear limitations. I'm trying to decide what "good enough" is, and whether I can meet that bar. In my own experience dogfooding what I've got so far, it makes writing C# within emacs much nicer.
Cheeso
+9  A: 

I can tell you roughly how the Delphi IDE works with the Delphi compiler to do intellisense (code insight is what Delphi calls it). It's not 100% applicable to C#, but it's an interesting approach which deserves consideration.

Most semantic analysis in Delphi is done in the parser itself. Expressions are typed as they are parsed, except for situations where this is not easy - in which case look-ahead parsing is used to work out what's intended, and then that decision is used in the parse.

The parse is largely LL(2) recursive descent, except for expressions, which are parsed using operator precedence. One of the distinct things about Delphi is that it's a single-pass language, so constructs need to be declared before being used, so no top-level pass is needed to bring that information out.

This combination of features means that the parser has roughly all the information needed for code insight for any point where it's needed. The way it works is this: the IDE informs the compiler's lexer of the position of the cursor (the point where code insight is desired) and the lexer turns this into a special token (it's called the kibitz token). Whenever the parser meets this token (which could be anywhere) it knows that this is the signal to send back all the information it has back to the editor. It does this using a longjmp because it's written in C; what it does is it notifies the ultimate caller of the kind of syntactic construct (i.e. grammatical context) the kibitz point was found in, as well as all the symbolic tables necessary for that point. So for example, if the context is in an expression which is an argument to a method, the we can check the method overloads, look at the argument types, and filter the valid symbols to only those which can resolve to that argument type (this cuts down in a lot of irrelevant cruft in the drop-down). If it's in a nested scope context (e.g. after a "."), the parser will have handed back a reference to the scope, and the IDE can enumerate all the symbols found in that scope.

Other things are also done; for example, method bodies are skipped if the kibitz token does not lie in their range - this is done optimistically, and rolled back if it skipped over the token. The equivalent of extension methods - class helpers in Delphi - have a kind of versioned cache, so their lookup is reasonably fast. But Delphi's generic type inference is much weaker than C#'s.

Now, to the specific question: inferring the types of variables declared with var is equivalent to the way Pascal infers the type of constants. It comes from the type of the initialization expression. These types are built from the bottom up. If x is of type Integer, and y is of type Double, then x + y will be of type Double, because those are the rules of the language; etc. You follow these rules until you have a type for the full expression on the right hand side, and that's the type you use for the symbol on the left.

Barry Kelly