views:

64

answers:

2

I was wondering how stackoverflow parses all sorts of different code and identifies keywords, special characters, whitespace formatting, etc. It does this for most code I believe, and I've noticed it's even sophisticated enough to understand the relationships between everything it parses, like so:

String mystring1 = "inquotes"; //incomment
String mystring2 = "inquotes//incomment";
String mystring3 = //incomment"inquotes";

Many IDEs do this also. How is this done?

Edit: Further explaination - I am not asking about the parsing of the text, my question is, once I am past that part.. is there something like a universal XML Scheme, or cross-code format hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or must I become a syntax guru for any language I wish to parse accurately?

+2  A: 

In order to correctly highlight a language, you have to build a parse tree. This requires first tokenizing the string, and then either performing a top-down or a bottom-up parse. Afterwards, something walks the tree and highlights the portions of the original string corresponding to nodes of a certain sort.

To really understand this, you're going to have to read a book on compiler design/programming language fundamentals. The relevant topics are tokenizers, parsing, and grammars.

Borealid
Any college course on compilers should be a good start.
Jarrett Meyer
"Any college course on X" is a good start for a *lot* of questions here, but it's not typically a very helpful answer since the askers aren't often in a position to take such a course. If they were, they could just go ask the professor instead of hoping us random geeks on the internet will feel like answering them.
Ken
Haha well said Ken. I would love to have the opportunity to take a college course on compilers but that is not possible for my situation.@Borealid - I am familiar with parse trees, my question is, once I am past that part.. is there something like a universal XML Scheme, or code structuring hierarchy that describes which strings are keywords, which characters denote comments, text strings, logic operators, etc. Or am I to become a syntax master for every language I wish to parse accurately?
stupidkid
@stupidkid: Tokenization is dealing with a language's *syntax*. Parsing is dealing with its *semantics*. XML represents a universal syntax. There will never be and cannot be universal semantics - they are meaning. What a "logic operator" is in one language is dependent on the language. So yes, you have to build a different parser for each language you want to understand. Take a look, however, at "parser generators" like Bison. You feed them an abstract description of the language's grammar, and they spit out C source for a parser.
Borealid
+3  A: 

To really have your IDE/compiler/interpreter "understand" and colorize code you'll need to parse it and pull out the different syntactical parts. The classic reference for this is the Dragon Book, "Compilers: Principles, Techniques, and Tools." You can see some of the difficulty in constructs like this

i+++++i; 

or

list<list<hash<list<int>,hash<int,<list>>>>>;
//or just matching parens 

Properly doing this is a hard problem. Some languages, like java, make this easier than others, such as C and C++ (which both have standards) or ruby (which doesn't even have a spec and relies on the implementation as a spec). However, if you only want to do a few bits of highlighting you can skip large parts of the grammar and get an 80% solution more easily. I suspect that the SO engine knows about strings and a few different types of comments and this does well enough for their purpose.

The difficulty between 80% and 100% is one reason that most IDEs have syntax highlighting for C++ but Visual C++ still doesn't have C++ refactoring support. For highlighting a few mistakes are probably OK. When you're refactoring you need to really understand variable scope in different namespaces and all sorts of pointer stuff too.

Paul Rubel
+1 for invoking the Dragon
Stephen P
+1 from me, too, for providing a direct link to the Dragon Book.
Android Eve