+1  A: 

The best way is probably to reuse something existing, such as ScintillaNET.

M4N
Thanks. I've looked at that as well as at several others, but I'm more interested in knowing exactly how it is implemented in the more commercial IDEs.
Pessimist
+1  A: 

I don't think there is a "this is the best way and any other way is less efficient" way to do it. In reality I don't think that efficiency is the major problem. Rather complexity is. A good syntax highlighter is based on a good parser. As long as you can parse the code you can highlight every part of it in any way you like. But, what happens then when the code is not well-formed? A lot of syntax highlighter just highlight keywords and a few block structures to overcome this problem. By doing this, they can use simple regular expressions instead of having a full-fledged, syntax-error tolerant parser (which is what Visual Studio has).

klausbyskov
Yes, but see, wouldn't regexes be a lot less efficient than whatever way Microsoft is doing it? So you say that for Microsoft it's very easy since they have the parser for the language anyway (for compiling) and thus they can just use that for the syntax highlighting and get it right 100% automatically?If so, can you point to code that implements it that way?
Pessimist
@Pessimist: You may want to look at this: http://bit.ly/3w5wK3 and this: http://bit.ly/dxDrkx
klausbyskov
@Pessimist: but please note that the compiler alone cannot be used when the highlighted code is not well-formed. Furthermore, as you have probably read on wikipedia, the regex approach is not necesarily very efficient.
klausbyskov
Although no code was offered and I'd really like to see some code implementing the suggested solution (but also addressing the bigger picture, as outlined in the original question - things like only highlighting visible code, etc), I understand what is being talked about here and I can relate this to what I already know. I'll choose this as the accepted answer.
Pessimist
@klausbyskov: thanks for the links!
Pessimist
A: 

As with anything code.... there rarely is a "best" way. There are multiple ways of doing things and each of them have benefits and drawbacks.

That said, some form of the Interpreter Pattern is probably the most common way. According to the GoF book:

The Interpreter pattern is widely used in compilers implemented with object-oriented languages, as the Smalltalk compilers are. SPECTalk uses the pattern to interpret descriptions of input file formats. The QOCA constraint-solving toolkit uses it to evaluate constraints.

It also goes on to talk about it's limitations in the applicability section

  • the grammer is simple. For complex grammars, the class hierarchy for the grammer becomes large and unmanagable. Tools such as parser generators are a better alternative in such cases
  • effeciency is not a critical concern. The most efficient interpreters are usually not implemented by interpreting parse trees directly but by first translating them into another form. For example, regular expressions are often transformed into state machines. But even then, the translator can be implemented by the Interpreter pattern, so the pattern is still applicable.

Understanding this, you should now know why it's better to pre-compile your reusable RegEx first before performing many matches with it. If you don't, it will have to do both steps every time (transformation, interpretations) rather than building the state machine once, and applying it efficiently several times over.

Specifically for the type of interpretation you are describing, Microsoft exposes the Microsoft.VisualStudio namespace and all of it's powerful features as part of the Visual Studio SDK. You can also look at System.CodeDOM for dynamic code generation and compilation.

slf