views:

182

answers:

3

Hello.

I'm currently learning how lexers and parsers work, and i have following question about state machine. For example, i need to colorize text by following rule: For this rule simple state transition table will look like this:

current event next  action
IDLE    $     COLOR -
COLOR   any   -     OnColor()
COLOR   \n    IDLE  -

This will call OnColor() action for every character that is between '$' and line end so i can colorize it. Of course same can be automatically generated from regexp, but i really want to know how it works before heavy magic usage :). Next goes problem: if i have a rule: (want to color any line of text that ends with dollar, the state transition table is not very clear:

current      event next             action
IDLE         any   -                -
IDLE         $     DOUND_DOLLAR     -
FOUND_DOLLAR \n    IDLE             OnDollar()
FOUND_DOLLAR any   IDLE             -

I can teach my state machine to call OnDollar() if it founds a '$' sign at end of line, but what i can do in order to colorize text that was BEFORE dollar sign encounter? What are common patterns to solve such problems? Of course it will be 1 line with regexp, but i'm really interested to know how such parser can be implemented via state machine and is it possible at all.

A: 

If you are constrained to color one character at a time (i.e. you have no buffering, lookahead, recoloring or marking capability), then it is impossible.

Otherwise, if you have such capabilities, it can be done; the technique depends on what's available.

  • Recoloring - have an action that can recolor n characters back. Obviously, this is a trivial solution.

  • Buffering / marking - have an action that places character onto end of a buffer / sets a named mark in the source, rather than letting the character through. Then, when you find out later what to do, have an action that commits the buffer one way or another, or flushes from a named mark. Recoloring more than 1 character with this gets somewhat complicated though.

  • Lookahead - have speculative transitions, i.e. use an NFA instead of a DFA.

Barry Kelly
I have no constraits at all, want to know the 'correct' way to solve such task in a real word. Of course i can create a temp buffers and tinker with additional flags / multiple states, but i feel that somehow this falls out of straightforward design of transitional table state machine :(
Eye of Hell
I'm not kidding around, "Eye of Hell" - it is impossible to color the character before the character that triggers the coloring, if you don't either have some buffering or some lookahead.
Barry Kelly
Maybe it is wise to scan text first into lexem and when parse to tokens, analyzing not only lexem ray text but also they position one to another? Or maybe it is well-known trick that allows a transition table state machine accumulate data more smart way? :)
Eye of Hell
A: 

Most colorizers always work on a larger block, say a whole line (which is sufficient in most cases) plus a "leak" flag for, say, multi-line comments. See the Qt Syntax Highlighter example for such an API.

Aaron Digulla
Qt syntax highlighter uses standard regexp grammar that is kind of higher level :). I'm interested to learn how to create correct state machines before creating them automatically via regexp magic :)
Eye of Hell
true but the whole layout of the API, what data it needs and how it chops the input, should give you many ideas how to approach the problem. Use the source, Luke!
Aaron Digulla
A: 

By reading "Purple Dragon Book" (sic) it seems that modern compilers and interpreters actively using "look ahead" buffer and accumulate recent text, so they can easily check few next symbols and few previous symbols in order to get exact lexem type.

So, in my example event() need to look at next and previous symbols in order to decide type of lexem that might be accumulated.

Eye of Hell