views:

770

answers:

6

I was hoping to write my own syntax highlighter for a summer project I am thinking of working on but I am not sure how to write my own syntax highlighter.

I know that there are bunch of implementations out there but I would like to learn about regular expressions and how syntax highlighting works.

How does syntax highlighting work and what are some good references for developing one? Does the syntax highlighter scan each character as it is typed or does it scan the document/text area as a whole after each character is typed?

Any insight would be greatly appreciated.

Thanks.

PS: I was planning on writing it in ActionScript

+1  A: 

You should treat the entire document as a whole at first. I think (without being an expert) you want to break down every token and make a parse tree.

Then if you have all that set-up, you could at first run the parser every time you type a new character. That might be good enough for your usecase, but if you want to keep things fast, you'll need to make modifications in your parse-tree as you get more information.

Evert
What you're basically building, is called a Lexer (I believe)
Evert
i agree that the document should be treated as a whole at first, but it is hard to say if this meets the specs based on the vague description of the project
Tony
+1  A: 

It might help if you explain what this syntax highlighter is for. If you are writing it in actionscript, is your idea to have a text box in a flash movie and highlight the syntax after a submit button is pushed? Or do you want to read the text from some webservice and then display the highlighted syntax? ...it's hard for me to help, because it is hard for me to imagine what you are doing

However, a syntax highlighter reads in text, then compares the lines of codes to some regex's which help the syntax highlighter figure out what the words mean. For example, it might read the word "function" or "int" as reserved words, and replace them with the html text:

<span class="reserved">function</span>, <span class="reserved"></span>

assuming you have the css and want reserved words in red,

.reserved{
  color: #ff0000;
}

This is the basic concept and you may want to take ideas from geshi since you can view the source.

Tony
Sorry I didn't clarify more. I am planning on implementing a collaborative text editor in Adobe Flex. What I want is to have a TextArea or a similar text input component...then as the user typed, for instance Java code, the code that they were typing would become syntax highlighted. Much like any IDE with syntax highlighting.
tkeE2036
A: 

In StackOverflow podcast number 50 Steve Yegge talks a little about his project for creating some general highlight mechanism. Not a finished product and maybe more sophisticated than you are looking for, but there could be something of interest.

hlovdal
+2  A: 

Syntax highlighters can work in two very general ways: either they implement a full lexer and parser for the language(s) they are highlighting, exactly identify each token's type (keyword, class name, instance name, variable type, preprocessor directive...). From there, they have all the information they need to exactly highlight the code according to some specification (keywords in red, class names in blue, what have you).

You may also want to look at Google Code Prettify, which instead of implementing one lexer/parser per language, has a couple of very general parsers that can do a decent job on most syntaxes. This highlighter, for example, will be able to parse and highlight reasonably well any C-like language, because its lexer/parser can identify the general components of those kinds of languages.

It also has the advantage that, as a result, you don't need to explicitely specify the language, as the engine will determine by itself which of its generic parsers can do the best job. The downside of course is that highlighting is less perfect than when a language-specific parser is used.

David Anderson
You started to say that highlighters worked in two general ways but then unless I misunderstood, you didn't explain the second way.
Marplesoft
+1  A: 

Unfortunatelly, I never used Actionscript, so I cannot help with that part.

But apart from that, a good start to writing a syntax highlighter would be to look at existing ones. For example, vim has syntax files in form of ordinary text files, so you could look at those for a start. There is a bunch of regular expressions there (regular expressions come in several flavours, but they're not so different ...), so for that part you might take a glance at some book.

Personally, I've found Beginning regular expressions to be a nice one. Mastering regular expressions is also nice for more advanced subjects. Regular expressions pocket reference is on the other hand nice for determining the differences in the above mentioned flavours, since it includes a chapter on vim's regex as well.

ldigas
A: 

Hi, I have posted an SQL code coloring tool on my blog a while ago: http://gruchalski.com/2009/04/26/flex-textrange-performance-issue-on-linux/

You can find a link to sqlcodecoloring.zip with the source. It is implemented using tokenizer and a TextRange class.

Another link, sql code coloring as part of the prototype app: http://github.com/radekg/mysqlinterface/tree/master

radekg