I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule.
Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element.
Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language:
token NEWLINE
matches r"\n\ *"
increase line count
pick up and store the indentation level
remember to also record the current level of parenthesis
procedure layout tokens
level = stack of indentation levels
push 0 to level
last_newline = none
per each token
if it is NEWLINE put it to last_newline and get next token
if last_newline contains something
extract new_level and parenthesis_count from last_newline
- if newline was inside parentheses, do nothing
- if new_level > level.top
push new_level to level
emit last_newline as INDENT token and clear last_newline
- if new_level == level.top
emit last_newline and clear last_newline
- otherwise
while new_level < level.top
pop from level
if new_level > level.top
freak out, indentation is broken.
emit last_newline as DEDENT token
clear last_newline
emit token
while level.top != 0
emit token as DEDENT token
pop from level
comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser
This layouter doesn't generate more than one NEWLINE at time, and doesn't generate NEWLINE when there's indentation coming up. Therefore parsing rules remain quite simple. It's pretty good I think but inform if there's better way of accomplishing it.
While using this for a while, I've noticed that after DEDENTs it may be nice to emit newline anyway, this way you can separate the expressions with NEWLINE while keeping the INDENT DEDENT as a trailer for expression.