views:

844

answers:

2

Lots of editors and IDEs have code completion. Some of them are very "intelligent" others are not really. I am interested in the more intelligent type. For example I have seen IDEs that only offer a function if it is a) available in the current scope b) it's return value is valid. (For example after "5 + foo[tab]" it only offers functions that return something that can be added to an integer or variable names of the correct type.) I have also seen that they place the more often used or longest option ahead of the list.

I realize you need to parse the code. But usually while editing the current code is invalid there are syntax errors in it. How do you parse something when it is incomplete and contains errors?

There is also a time constraint. The completion is useless if it takes seconds to come up with a list. Sometimes the completion algorithm deals with thousands of classes.

What are the good algorithms and data structures for this?

+7  A: 

I can't say exactly what algorithms are used by any particular implementation, but I can make some educated guesses. A trie is a very useful data structure for this problem: the IDE can maintain a large trie in memory of all of the symbols in your project, with some extra metadata at each node.

When you type a character, it walks down a path in the trie. All of the descendants of a particular trie node are possible completions. The IDE then just needs to filter those out by the ones that make sense in the current context, but it only needs to compute as many as can be displayed in the tab-completion pop-up window.

More advanced tab-completion requires a more complicated trie. For example, Visual Assist X has a feature whereby you only need to type the capital letters of CamelCase symbols -- e.g., if you type SFN, it shows you the symbol SomeFunctionName in its tab-completion window.

Computing the trie (or other data structures) does require parsing all of your code to get a list of all of the symbols in your project. Visual Studio stores this in its IntelliSense database, an .ncb file stored alongside your project, so that it doesn't have to reparse everything every time you close and reopen your project. The first time you open a large project (say, one you just synced form source control), VS will take the time to parse everything and generate the database.

I don't know how it handles incremental changes. As you said, when you're writing code, it's invalid syntax 90% of the time, and reparsing everything whenever you idled would put a huge tax on your CPU for very little benefit, especially if you're modifying a header file included by a large number of source files.

I suspect that it either (a) only reparses whenever you actually build your project (or possibly when you close/open it), or (b) it does some sort of local parsing where it only parses the code around where you've just edited in some limited fashion, just to get the names of the relevant symbols. Since C++ has such an outstandingly complicated grammar, it may behave oddly in the dark corners if you're using heavy template metaprogramming and the like.

Adam Rosenfield
The trie is a really good idea. As for the incremental changes it may be possible to first try to re-parse the file when that does not work ignore the current line and when that does not work ignore the enclosing { ... } block. If all else fails use the last database.
stribika
+9  A: 

The IntelliSense engine in my UnrealScript language service product is complicated, but I'll give as best an overview here as I can. The C# language service in VS2008 SP1 is my performance goal (for good reason). It's not there yet, but it's fast/accurate enough that I can safely offer suggestions after a single character is typed, without waiting for ctrl+space or the user typing a . (dot). The more information people [working on language services] get about this subject, the better end-user experience I get should I ever use their products. There are a number of products I've had the unfortunate experience of working with that didn't pay such close attention to details, and as a result I was fighting with the IDE more than I was coding.

In my language service, it's laid out like the following:

  1. Get the expression at the cursor. This walks from the beginning of the member access expression to the end of the identifier the cursor is over. The member access expression is generally in the form aa.bb.cc, but can also contain method calls as in aa.bb(3+2).cc.
  2. Get the context surrounding the cursor. This is very tricky, because it doesn't always follow the same rules as the compiler (long story), but for here assume it does. Generally this means get the cached information about the method/class the cursor is within.
  3. Say the context object implements IDeclarationProvider, where you can call GetDeclarations() to get an IEnumerable<IDeclaration> of all items visible in the scope. In my case, this list contains the locals/parameters (if in a method), members (fields and methods, static only unless in an instance method, and no private members of base types), globals (types and constants for the language I'm working on), and keywords. In this list will be an item with the name aa. As a first step in evaluating the expression in #1, we select the item from the context enumeration with the name aa, giving us an IDeclaration for the next step.
  4. Next, I apply the operator to the IDeclaration representing aa to get another IEnumerable<IDeclaration> containing the "members" (in some sense) of aa. Since the . operator is different from the -> operator, I call declaration.GetMembers(".") and expect the IDeclaration object to correctly apply the listed operator.
  5. This continues until I hit cc, where the declaration list may or may not contain an object with the name cc. As I'm sure you're aware, if multiple items begin with cc, they should appear as well. I solve this by taking the final enumeration and passing it through my documented algorithm to provide the user with the most helpful information possible.

Here are some additional notes for the IntelliSense backend:

  • I make extensive use of LINQ's lazy evaluation mechanisms in implementing GetMembers. Each object in my cache is able to provide a functor that evaluates to its members, so performing complicated actions with the tree is near trivial.
  • Instead of each object keeping a List<IDeclaration> of its members, I keep a List<Name>, where Name is a struct containing the hash of a specially-formatted string describing the member. There's an enormous cache that maps names to objects. This way, when I re-parse a file, I can remove all items declared in the file from the cache and repopulate it with the updated members. Due to the way the functors are configured, all expressions immediately evaluate to the new items.

IntelliSense "frontend"

As the user types, the file is syntactically incorrect more often than it is correct. As such, I don't want to haphazardly remove sections of the cache when the user types. I have a large number of special-case rules in place to handle incremental updates as quickly as possible. The incremental cache is only kept local to an open file and helps make ensure the user doesn't realize that their typing is causing the backend cache to hold incorrect line/column information for things like each method in the file.

  • One redeeming factor is my parser is fast. It can handle a full cache update of a 20000 line source file in 150ms while operating self-contained on a low priority background thread. Whenever this parser completes a pass on an open file successfully (syntactically), the current state of the file is moved into the global cache.
  • If the file is not syntactically correct, I use an ANTLR filter parser (sorry about the link - most info is on the mailing list or gathered from reading the source) to reparse the file looking for:
    • Variable/field declarations.
    • The signature for class/struct definitions.
    • The signature for method definitions.
  • In the local cache, class/struct/method definitions begin at the signature and end when the brace nesting level goes back to even. Methods can also end if another method declaration is reached (no nesting methods).
  • In the local cache, variables/fields are linked to the immediately preceding unclosed element. See the brief code snippet below for an example of why this is important.
  • Also, as the user types, I keep a remap table marking the added/removed character ranges. This is used for:
    • Making sure I can identify the correct context of the cursor, since a method can/does move in the file between full parses.
    • Making sure Go To Declaration/Definition/Reference locates items correctly in open files.

Code snippet for the previous section:

class A
{
    int x; // linked to A

    void foo() // linked to A
    {
        int local; // linked to foo()

    // foo() ends here because bar() is starting
    void bar() // linked to A
    {
        int local2; // linked to bar()
    }

    int y; // linked again to A

I figured I'd add a list of the IntelliSense features I've implemented with this layout. Pictures of each are located here.

  • Auto-complete
  • Tool tips
  • Method Tips
  • Class View
  • Code Definition Window
  • Call Browser (VS 2010 finally adds this to C#)
  • Semantically correct Find All References
280Z28
This is great thank you. I never thought of the case sensitive bias when sorting. I especially like that you can deal with mismatched braces.
stribika