代码完成是如何工作的?

许多编辑器和 IDE 都有代码完成功能。他们中的一些人非常“聪明”,另一些人并不真正聪明。我对更聪明的类型感兴趣。例如,我见过的 IDE 只提供一个函数,如果它是 a)在当前作用域中可用 b)它的返回值是有效的。(例如,在“5 + foo [ tab ]”之后,它只提供返回可以添加到正确类型的整数或变量名的函数。)我还看到,他们把更经常使用或最长的选项放在列表的前面。

我知道你需要解析代码。但是通常当编辑当前代码是无效的时候,其中存在语法错误。当某些内容不完整且包含错误时,如何解析它们?

还有一个时间限制。如果需要几秒钟才能得到一个列表,那么完成是无用的。有时,完成算法处理成千上万个类。

什么是好的算法和数据结构?

17949 次浏览

I can't say exactly what algorithms are used by any particular implementation, but I can make some educated guesses. A trie is a very useful data structure for this problem: the IDE can maintain a large trie in memory of all of the symbols in your project, with some extra metadata at each node.

When you type a character, it walks down a path in the trie. All of the descendants of a particular trie node are possible completions. The IDE then just needs to filter those out by the ones that make sense in the current context, but it only needs to compute as many as can be displayed in the tab-completion pop-up window.

More advanced tab-completion requires a more complicated trie. For example, Visual Assist X has a feature whereby you only need to type the capital letters of CamelCase symbols -- e.g., if you type SFN, it shows you the symbol SomeFunctionName in its tab-completion window.

Computing the trie (or other data structures) does require parsing all of your code to get a list of all of the symbols in your project. Visual Studio stores this in its IntelliSense database, an .ncb file stored alongside your project, so that it doesn't have to reparse everything every time you close and reopen your project. The first time you open a large project (say, one you just synced form source control), VS will take the time to parse everything and generate the database.

I don't know how it handles incremental changes. As you said, when you're writing code, it's invalid syntax 90% of the time, and reparsing everything whenever you idled would put a huge tax on your CPU for very little benefit, especially if you're modifying a header file included by a large number of source files.

I suspect that it either (a) only reparses whenever you actually build your project (or possibly when you close/open it), or (b) it does some sort of local parsing where it only parses the code around where you've just edited in some limited fashion, just to get the names of the relevant symbols. Since C++ has such an outstandingly complicated grammar, it may behave oddly in the dark corners if you're using heavy template metaprogramming and the like.

The IntelliSense engine in my UnrealScript language service product is complicated, but I'll give as best an overview here as I can. The C# language service in VS2008 SP1 is my performance goal (for good reason). It's not there yet, but it's fast/accurate enough that I can safely offer suggestions after a single character is typed, without waiting for ctrl+space or the user typing a . (dot). The more information people [working on language services] get about this subject, the better end-user experience I get should I ever use their products. There are a number of products I've had the unfortunate experience of working with that didn't pay such close attention to details, and as a result I was fighting with the IDE more than I was coding.

In my language service, it's laid out like the following:

  1. Get the expression at the cursor. This walks from the beginning of the member access expression to the end of the identifier the cursor is over. The member access expression is generally in the form aa.bb.cc, but can also contain method calls as in aa.bb(3+2).cc.
  2. Get the context surrounding the cursor. This is very tricky, because it doesn't always follow the same rules as the compiler (long story), but for here assume it does. Generally this means get the cached information about the method/class the cursor is within.
  3. Say the context object implements IDeclarationProvider, where you can call GetDeclarations() to get an IEnumerable<IDeclaration> of all items visible in the scope. In my case, this list contains the locals/parameters (if in a method), members (fields and methods, static only unless in an instance method, and no private members of base types), globals (types and constants for the language I'm working on), and keywords. In this list will be an item with the name aa. As a first step in evaluating the expression in #1, we select the item from the context enumeration with the name aa, giving us an IDeclaration for the next step.
  4. Next, I apply the operator to the IDeclaration representing aa to get another IEnumerable<IDeclaration> containing the "members" (in some sense) of aa. Since the . operator is different from the -> operator, I call declaration.GetMembers(".") and expect the IDeclaration object to correctly apply the listed operator.
  5. This continues until I hit cc, where the declaration list may or may not contain an object with the name cc. As I'm sure you're aware, if multiple items begin with cc, they should appear as well. I solve this by taking the final enumeration and passing it through my documented algorithm to provide the user with the most helpful information possible.

Here are some additional notes for the IntelliSense backend:

  • I make extensive use of LINQ's lazy evaluation mechanisms in implementing GetMembers. Each object in my cache is able to provide a functor that evaluates to its members, so performing complicated actions with the tree is near trivial.
  • Instead of each object keeping a List<IDeclaration> of its members, I keep a List<Name>, where Name is a struct containing the hash of a specially-formatted string describing the member. There's an enormous cache that maps names to objects. This way, when I re-parse a file, I can remove all items declared in the file from the cache and repopulate it with the updated members. Due to the way the functors are configured, all expressions immediately evaluate to the new items.

IntelliSense "frontend"

As the user types, the file is syntactically incorrect more often than it is correct. As such, I don't want to haphazardly remove sections of the cache when the user types. I have a large number of special-case rules in place to handle incremental updates as quickly as possible. The incremental cache is only kept local to an open file and helps make ensure the user doesn't realize that their typing is causing the backend cache to hold incorrect line/column information for things like each method in the file.

  • One redeeming factor is my parser is fast. It can handle a full cache update of a 20000 line source file in 150ms while operating self-contained on a low priority background thread. Whenever this parser completes a pass on an open file successfully (syntactically), the current state of the file is moved into the global cache.
  • If the file is not syntactically correct, I use an ANTLR filter parser (sorry about the link - most info is on the mailing list or gathered from reading the source) to reparse the file looking for:
    • Variable/field declarations.
    • The signature for class/struct definitions.
    • The signature for method definitions.
  • In the local cache, class/struct/method definitions begin at the signature and end when the brace nesting level goes back to even. Methods can also end if another method declaration is reached (no nesting methods).
  • In the local cache, variables/fields are linked to the immediately preceding unclosed element. See the brief code snippet below for an example of why this is important.
  • Also, as the user types, I keep a remap table marking the added/removed character ranges. This is used for:
    • Making sure I can identify the correct context of the cursor, since a method can/does move in the file between full parses.
    • Making sure Go To Declaration/Definition/Reference locates items correctly in open files.

Code snippet for the previous section:

class A
{
int x; // linked to A


void foo() // linked to A
{
int local; // linked to foo()


// foo() ends here because bar() is starting
void bar() // linked to A
{
int local2; // linked to bar()
}


int y; // linked again to A

I figured I'd add a list of the IntelliSense features I've implemented with this layout. Pictures of each are located here.

  • Auto-complete
  • Tool tips
  • Method Tips
  • Class View
  • Code Definition Window
  • Call Browser (VS 2010 finally adds this to C#)
  • Semantically correct Find All References

The following link will help you further..

Syntax Highlighting:Fast Colored TextBox for Syntax Highlighting