views:

712

answers:

4

I'm a big fan of documenting the proper behavior of IDE features that have a subtle but significant impact on coding flow - things like auto-completion selection and commenting/uncommenting code you might not realize you take advantage of but at the end of the day you got just a bit more done than you might have. I do so in hopes that other language services I have to use incorporate the feature(s), subsequently improving my daily coding life. "Real" Smart Indent, i.e. the Visual Studio 2008 C# editor, is one of those features.

Basic block code indentation is reasonably straightforward and can be hacked together in a reasonable amount of time well enough to get the job done. True Smart Indent, on the other hand, is quite possibly the most technically challenging task I've had to implement in the IDE to date, and I've implemented my fair share. Even full-blown on-the-fly automatic code reformatting is easier; it just defers to Smart Indent for the heavy lifting.

I'm looking for high-level discussions of general purpose Smart Indent algorithms. In particular, I'm looking for either research on smart indent strategies, or an objective description of all normal and "edge" cases that could be tested to ensure repeatable, bug-free results. Eventually, I'd like to provide both a detailed workflow of the functionality, a concrete foundation for actually implementing the feature, and finally assembling a language-specific version from that and integrating it into my language services.

PS: Visual Studio 2010's C# editor has several small bugs in this feature. Having implemented it myself, I have a whole new respect for the work it takes to polish it.

Edit (8/25): I managed to write down a draft the rules for how I think things should be handled when the smart indent is inside a code comment. I'll probably be working from a C++/C# perspective on the rules, but later they should be able to be parameterized for aspects of other languages.

+1  A: 

Maybe I'm missing something, but the "smart indentation" would be entirely tied up in the grammar specification of the language. The closest thing to an academic paper I could find after a bit of google-fu was, in fact, another SO question pertaining to a particular language, here.

So, I'm afraid I can't technically provide an answer, as I did not find any academic papers, but as a sort of meta-answer (sadly, in the form of a question): is it any harder than parsing the language? I use the term "harder" in a vague computability/complexity sense, not referring to the actual time/effort/tears a person would actually put in.

Consider: indentation level changes, in my experience, within certain sub-clauses. If statements, loops, classes, structs, etc. etc. All of these are already detected by the parser. Just as one can decorate a parse tree to build a semantic tree (here's a shard of a random university website), can't you instead decorate the parse tree with "indent information"?

I guess I just don't see what the call for academic papers is all about. Unless if, of course, there's something I'm missing. Which is quite possible, as I've certainly never dared attempt this. :) But, from my vantage point, it would seem that this smart indenting is possible simply by running a modified parser, and instead of reporting "parse errors", it automatically reformats the code so that it is valid (assuming that the "real" parser already okays the block). Real-time running would certainly cause issues, and there are ambigous levels of indentation in whitespace-dependent language (as the indent level is the end of the block).

As a final (honestly, I'm almost done! :)) note: the Emacs text editory is shockingly good, in my experience. I have no idea how it works, but if I were to try this, that would be the first place I'd look... after SO, of course. :))

Agor
I changed up the question ever-so-slightly (or a lot). I'm having an enormous time creating a testing procedure to prevent regressions while I fix pesky bugs. It's significantly harder than parsing because 1) speed matters *big time* and 2) the document is almost never syntactically correct at the time Smart Indent is invoked.
280Z28
+1 for "document is almost never syntactically correct". That indeed makes it harder. You can still do well by parsing with error correction; the least cost repair tells you what should have been there, and then you can reduce the problem to prettyprinting a clean tree.
Ira Baxter
+3  A: 

Emacs CC Mode manual: Indentation Engine Basics.

Steve Yegge blog rant: js2-mode: a new JavaScript mode for Emacs.

Quote from the latter: "Amazingly, surprisingly, counterintuitively, the indentation problem is almost totally orthogonal to parsing and syntax validation."

Marius Andersen
Steve's blog makes complete sense. I totally understand where he's coming from. One thing to note: his blog post is missing a ton of cases that need to be considered, but that's most likely because there are too many to list.
280Z28
Well I'll be. My only consolation is that Steve too believes that it was surprising to learn that the problem is orthagonal to parsing. :) +1
Agor
My personal opinion, based on building real tools, is that "smart indentation" requires you parse the code, and then prettyprint. See other ansswers.
Ira Baxter
+2  A: 

The magic search phrase you are looking for might be "pretty print".

Steven Huwig
Amen. +1 (Apparantly I need to type 15 characters here, just to say I gave you a point).
Ira Baxter
+2  A: 

Like another responder, the key idea for doing this right is prettyprinting, that is, generating text from the abstract syntax structure of the code.

Basically you take advantage of the nesting of the tree to produce nesting of the printed text. Key ideas are the notion of building primitive strings from leaves of the tree, gluing horizontal boxes [rectangles of text] together from other boxes from subttrees to provide horizontal composition, and gluing boxes on top of one another to get bigger vertical boxes.

Tricky parts: regenerating the langauge literals with formatting information from the tree leaves (just how many leading zeros did that binary float point number have?), handling right margin overflow by allowing alternative box layouts and backtracking, and pattern matching complex tree structures to prettyprint particular trees in nice ways (e.g., nested if-then-if-then-if....)

Here's a research paper on the topic (Full text PDF).

Here's what we did for prettyprinting with the DMS Software Reengineering Toolkit to prettyprint ASTs generated by large-scale metaprogramming.

Ira Baxter
+1 That paper will definitely be helpful for documenting a generalized Smart Indent behavior. I do believe you are vastly underestimating the difficulty of relating Smart Indent to *any* normal parsing algorithm.
280Z28