How does code completion work?

stribika picture stribika · Aug 3, 2009 · Viewed 15.4k times · Source

Lots of editors and IDEs have code completion. Some of them are very "intelligent" others are not really. I am interested in the more intelligent type. For example I have seen IDEs that only offer a function if it is a) available in the current scope b) its return value is valid. (For example after "5 + foo[tab]" it only offers functions that return something that can be added to an integer or variable names of the correct type.) I have also seen that they place the more often used or longest option ahead of the list.

I realize you need to parse the code. But usually while editing the current code is invalid there are syntax errors in it. How do you parse something when it is incomplete and contains errors?

There is also a time constraint. The completion is useless if it takes seconds to come up with a list. Sometimes the completion algorithm deals with thousands of classes.

What are the good algorithms and data structures for this?

Answer

Sam Harwell picture Sam Harwell · Aug 3, 2009

The IntelliSense engine in my UnrealScript language service product is complicated, but I'll give as best an overview here as I can. The C# language service in VS2008 SP1 is my performance goal (for good reason). It's not there yet, but it's fast/accurate enough that I can safely offer suggestions after a single character is typed, without waiting for ctrl+space or the user typing a . (dot). The more information people [working on language services] get about this subject, the better end-user experience I get should I ever use their products. There are a number of products I've had the unfortunate experience of working with that didn't pay such close attention to details, and as a result I was fighting with the IDE more than I was coding.

In my language service, it's laid out like the following:

  1. Get the expression at the cursor. This walks from the beginning of the member access expression to the end of the identifier the cursor is over. The member access expression is generally in the form aa.bb.cc, but can also contain method calls as in aa.bb(3+2).cc.
  2. Get the context surrounding the cursor. This is very tricky, because it doesn't always follow the same rules as the compiler (long story), but for here assume it does. Generally this means get the cached information about the method/class the cursor is within.
  3. Say the context object implements IDeclarationProvider, where you can call GetDeclarations() to get an IEnumerable<IDeclaration> of all items visible in the scope. In my case, this list contains the locals/parameters (if in a method), members (fields and methods, static only unless in an instance method, and no private members of base types), globals (types and constants for the language I'm working on), and keywords. In this list will be an item with the name aa. As a first step in evaluating the expression in #1, we select the item from the context enumeration with the name aa, giving us an IDeclaration for the next step.
  4. Next, I apply the operator to the IDeclaration representing aa to get another IEnumerable<IDeclaration> containing the "members" (in some sense) of aa. Since the . operator is different from the -> operator, I call declaration.GetMembers(".") and expect the IDeclaration object to correctly apply the listed operator.
  5. This continues until I hit cc, where the declaration list may or may not contain an object with the name cc. As I'm sure you're aware, if multiple items begin with cc, they should appear as well. I solve this by taking the final enumeration and passing it through my documented algorithm to provide the user with the most helpful information possible.

Here are some additional notes for the IntelliSense backend:

  • I make extensive use of LINQ's lazy evaluation mechanisms in implementing GetMembers. Each object in my cache is able to provide a functor that evaluates to its members, so performing complicated actions with the tree is near trivial.
  • Instead of each object keeping a List<IDeclaration> of its members, I keep a List<Name>, where Name is a struct containing the hash of a specially-formatted string describing the member. There's an enormous cache that maps names to objects. This way, when I re-parse a file, I can remove all items declared in the file from the cache and repopulate it with the updated members. Due to the way the functors are configured, all expressions immediately evaluate to the new items.

IntelliSense "frontend"

As the user types, the file is syntactically incorrect more often than it is correct. As such, I don't want to haphazardly remove sections of the cache when the user types. I have a large number of special-case rules in place to handle incremental updates as quickly as possible. The incremental cache is only kept local to an open file and helps make ensure the user doesn't realize that their typing is causing the backend cache to hold incorrect line/column information for things like each method in the file.

  • One redeeming factor is my parser is fast. It can handle a full cache update of a 20000 line source file in 150ms while operating self-contained on a low priority background thread. Whenever this parser completes a pass on an open file successfully (syntactically), the current state of the file is moved into the global cache.
  • If the file is not syntactically correct, I use an ANTLR filter parser (sorry about the link - most info is on the mailing list or gathered from reading the source) to reparse the file looking for:
    • Variable/field declarations.
    • The signature for class/struct definitions.
    • The signature for method definitions.
  • In the local cache, class/struct/method definitions begin at the signature and end when the brace nesting level goes back to even. Methods can also end if another method declaration is reached (no nesting methods).
  • In the local cache, variables/fields are linked to the immediately preceding unclosed element. See the brief code snippet below for an example of why this is important.
  • Also, as the user types, I keep a remap table marking the added/removed character ranges. This is used for:
    • Making sure I can identify the correct context of the cursor, since a method can/does move in the file between full parses.
    • Making sure Go To Declaration/Definition/Reference locates items correctly in open files.

Code snippet for the previous section:

class A
{
    int x; // linked to A

    void foo() // linked to A
    {
        int local; // linked to foo()

    // foo() ends here because bar() is starting
    void bar() // linked to A
    {
        int local2; // linked to bar()
    }

    int y; // linked again to A

I figured I'd add a list of the IntelliSense features I've implemented with this layout. Pictures of each are located here.

  • Auto-complete
  • Tool tips
  • Method Tips
  • Class View
  • Code Definition Window
  • Call Browser (VS 2010 finally adds this to C#)
  • Semantically correct Find All References