views:

298

answers:

1

I'm looking for some good overviews of best practices and common patterns for enabling syntax highlighting in a textbox. It seems like a very common exercise almost all languages have a UI control that enables syntax highlighting in different languages. I'm just curious to see if there is a common pattern of implementation.

Is everyone using regular expressions? Is there a repository for regular expressions that are commonly used in syntax highlighting scenarios?

Are there alternative/better approaches to syntax highlighting?

Update

Links to relevant resources about performing syntax highlighting in a given language or concepts related to syntax highlighting would be great. Lexing (lexical analysis) was brought up in an answer but without a link to learn more. Anything to help better understand this commonly solved problem would be great.

Lexical Analysis on Wikipedia

+2  A: 

Regular expressions are definitely the first place most start out at. However, they can't really cope with many edge cases that one meets in most languages - text that looks like keywords can be in found string literals, string literals in turn can contain escaped delimiters, as well as special characters. Same thing goes for comments, etc.

Basically to do a good job of syntax highlighting you need to perform lexing of the source - parsing it with the application of language-specific heuristics to build a list of regions, where each region of the source is annotated with how it is to be styled.

As edits take place, you can again apply language rules to see how far this change can alter the presentation of a region. For example typing a letter inside a string literal simply makes the string literal region longer, but typing a closing quote truncates the region and turns the leftover part of it into code, subject to all the other lexing rules.

levik
So I would guess most web based (javascript) highlighters are using regular expressions and actual IDE's are lexing?
spoon16
Probably - though even with JS, the good editors will likely lex. The reg-ex ones, well, they get confused at times. I know I've seen this happen in some editors where they think a quote which is escaped is actually a string delimiter.
levik