ansaurus

Question

How would you go about implementing off-side rule?

Answer 1

+2 A:

I've written tokenizers and parsers for a couple of little indentation-centric domain-specific languages in the past couple of years, and what you have there looks pretty reasonable to me, for whatever that's worth. If I'm not mistaken, your method is quite similar to what Python does, for example, which seems like it ought to carry some weight.

Converting NEWLINE NEWLINE INDENT to just INDENT before it hits the parser definitely seems like the right way to do things -- it's a pain (IME) to always be peeking ahead for that in the parser! I've actually done that step as a separate layer in what ended up being a three step process: the first combined what your lexer and layouter do minus all the NEWLINE lookahead stuff (which made it very simple), the second (also very simple) layer folded consecutive NEWLINEs and converted NEWLINE INDENT to just INDENT (or, actually, COLON NEWLINE INDENT to INDENT, since in this case all indented blocks were always preceded by colons), then the parser was the third stage on top of that. But it also makes a lot of sense to me to do things the way you've described them, especially if you want to separate the lexer from the layouter, which presumably you'd want to do if you were using a code-generation tool to make your lexer, for instance, as is common practice.

I did have one application that needed to be a bit more flexible about indentation rules, essentially leaving the parser to enforce them when needed -- the following needed to be valid in certain contexts, for instance:

this line introduces an indented block of literal text:
    this line of the block is indented four spaces
  but this line is only indented two spaces

which doesn't work terribly well with INDENT/DEDENT tokens, since you end up needing to generate one INDENT for each column of indentation and an equal number of DEDENTs on the way back, unless you look way ahead to figure out where the indent levels are going to end up being, which it doesn't seem like you'd want a tokenizer to do. In that case I tried a few different things and ended up just storing a counter in each NEWLINE token that gave the change in indentation (positive or negative) for the following logical line. (Each token also stored all trailing whitespace, in case it needed preserving; for NEWLINE, the stored whitespace included the EOL itself, any intervening blank lines, and the indentation on the following logical line.) No separate INDENT or DEDENT tokens at all. Getting the parser to deal with that was a bit more work than just nesting INDENTs and DEDENTs, and might well have been hell with a complicated grammar that needed a fancy parser generator, but it wasn't nearly as bad as I'd feared, either. Again, no need for the parser to look ahead from NEWLINE to see if there's an INDENT coming up in this scheme.

Still, I think you'd agree that allowing and preserving all manner of crazy-looking whitespace in the tokenizer/layouter and letting the parser decide what's a literal and what's code is a bit of an unusual requirement! You certainly wouldn't want your parser to be saddled with that indentation counter if you just wanted to be able to parse Python code, for example. The way you're doing things is almost certainly the right approach for your application and many others besides. Though if anyone else has thoughts on how best to do this sort of thing, I'd obviously love to hear them....

zaphod 2008-11-03 08:59:09

Answer 2

+2 A:

Ive been experimenting with this recently, and I came to the conclusion that, for my needs at least, I wanted the NEWLINES to mark the end of each "statement", whether it was the last statement in an indented block or not, i.e. I need the newlines even before DEDENT.

My solution was to turn it on its head, and instead of NEWLINES marking the end of lines, I use a LINE token to mark the start of a line.

I have a lexer that collapses empty lines (including comment-only lines) and emits a single LINE token with information about the indentation of the last line. Then my preprocessing function takes this token stream and adds INDENT or DEDENT "in between" any lines where the indentation changes. So

line1
    line2
    line3
line4

would give the token stream

LINE "line1" INDENT LINE "line2" LINE "line3" DEDENT LINE "line4" EOF

This allows me to write clear grammar productions for statements without worrying about detecting the end of statements even when they end with nested, indented, subblocks, something that can be hard if you are matching NEWLINES (and DEDENTS) instead.

Here is the core of the preprocessor, written in O'Caml:

  match next_token () with
      LINE indentation ->
        if indentation > !current_indentation then
          (
            Stack.push !current_indentation indentation_stack;
            current_indentation := indentation;
            INDENT
          )
        else if indentation < !current_indentation then
          (
            let prev = Stack.pop indentation_stack in
              if indentation > prev then
                (
                  current_indentation := indentation;
                  BAD_DEDENT
                )
              else
                (
                  current_indentation := prev;
                  DEDENT
                )
          )
        else (* indentation = !current_indentation *)
          let  token = remove_next_token () in
            if next_token () = EOF then
              remove_next_token ()
            else
              token
    | _ ->
        remove_next_token ()

I haven't added support for parentheses yet, but that should be a simple extension. It does, however avoid emitting a stray LINE at the end of the file.

dkagedal 2009-06-03 18:34:32

Your code is not able to emit multiple DEDENTs, neither it considers dedent before EOF. It can be useful for something, but those things are more important than parenthesis support.

Cheery 2009-06-04 09:31:51

Also, don't bother about special support for parentheses, you are going to just miss the best point, just like python does. The point of layouting is to allow you provide excellent multi-line syntax, it doesn't conflict with parenthesis, unless you aren't able to combine those two.

Cheery 2009-06-04 09:53:35

My code does emit multiple DEDENT, so I think you are misreading it.But I agree that I would like something that looks more like Haskell rather than Python, so I need a new approach.

dkagedal 2009-06-14 21:50:09

[I wrote one for fun in Haskell,][1] using Parsec, an embeded domain specific language. It's about 30 lines of code and fully commented. [1]: http://www.refactory.org/s/indentation_based_syntax_parser_tokenizer/view/latest

Christopher Done 2009-09-19 18:58:50

ansaurus

tags:

views:

answers:

How would you go about implementing off-side rule?

related questions