views:

634

answers:

2

I've already written a generator that does the trick, but I'd like to know the best possible way to implement the off-side rule.

Shortly: Off-side rule means in this context that indentation is getting recognized as a syntactic element.

Here is the offside rule in pseudocode for making tokenizers that capture indentation in usable form, I don't want to limit answers by language:

token NEWLINE
    matches r"\n\ *"
    increase line count
    pick up and store the indentation level
    remember to also record the current level of parenthesis

procedure layout tokens
    level = stack of indentation levels
    push 0 to level
    last_newline = none
    per each token
        if it is NEWLINE put it to last_newline and get next token
        if last_newline contains something
            extract new_level and parenthesis_count from last_newline
            - if newline was inside parentheses, do nothing
            - if new_level > level.top
                push new_level to level
                emit last_newline as INDENT token and clear last_newline
            - if new_level == level.top
                emit last_newline and clear last_newline
            - otherwise
                while new_level < level.top
                    pop from level
                    if new_level > level.top
                        freak out, indentation is broken.
                    emit last_newline as DEDENT token
                clear last_newline
        emit token
    while level.top != 0
        emit token as DEDENT token
        pop from level

comments are ignored before they are getting into the layouter
layouter lies between a lexer and a parser

This layouter doesn't generate more than one NEWLINE at time, and doesn't generate NEWLINE when there's indentation coming up. Therefore parsing rules remain quite simple. It's pretty good I think but inform if there's better way of accomplishing it.

While using this for a while, I've noticed that after DEDENTs it may be nice to emit newline anyway, this way you can separate the expressions with NEWLINE while keeping the INDENT DEDENT as a trailer for expression.

+2  A: 

I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.

Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.

I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.

Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....

zaphod
+2  A: 

Ive been experimenting with this recently, and I came to the conclusion that, for my needs at least, I wanted the NEWLINES to mark the end of each "statement", whether it was the last statement in an indented block or not, i.e. I need the newlines even before DEDENT.

My solution was to turn it on its head, and instead of NEWLINES marking the end of lines, I use a LINE token to mark the start of a line.

I have a lexer that collapses empty lines (including comment-only lines) and emits a single LINE token with information about the indentation of the last line. Then my preprocessing function takes this token stream and adds INDENT or DEDENT "in between" any lines where the indentation changes. So

line1
    line2
    line3
line4

would give the token stream

LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF

This allows me to write clear grammar productions for statements without worrying about detecting the end of statements even when they end with nested, indented, subblocks, something that can be hard if you are matching NEWLINES (and DEDENTS) instead.

Here is the core of the preprocessor, written in O'Caml:

  match next_token () with
      LINE indentation ->
        if indentation > !current_indentation then
          (
            Stack.push !current_indentation indentation_stack;
            current_indentation := indentation;
            INDENT
          )
        else if indentation < !current_indentation then
          (
            let prev = Stack.pop indentation_stack in
              if indentation > prev then
                (
                  current_indentation := indentation;
                  BAD_DEDENT
                )
              else
                (
                  current_indentation := prev;
                  DEDENT
                )
          )
        else (* indentation = !current_indentation *)
          let  token = remove_next_token () in
            if next_token () = EOF then
              remove_next_token ()
            else
              token
    | _ ->
        remove_next_token ()

I haven't added support for parentheses yet, but that should be a simple extension. It does, however avoid emitting a stray LINE at the end of the file.

dkagedal
Your code is not able to emit multiple DEDENTs, neither it considers dedent before EOF. It can be useful for something, but those things are more important than parenthesis support.
Cheery
Also, don't bother about special support for parentheses, you are going to just miss the best point, just like python does. The point of layouting is to allow you provide excellent multi-line syntax, it doesn't conflict with parenthesis, unless you aren't able to combine those two.
Cheery
My code does emit multiple DEDENT, so I think you are misreading it.But I agree that I would like something that looks more like Haskell rather than Python, so I need a new approach.
dkagedal
[I wrote one for fun in Haskell,][1] using Parsec, an embeded domain specific language. It's about 30 lines of code and fully commented. [1]: http://www.refactory.org/s/indentation_based_syntax_parser_tokenizer/view/latest
Christopher Done